Task-Agnostic Meta-Learning for few-shot learning

An overview of the paper “Task-Agnostic Meta-Learning for few-shot learning”. The author presents a method for a task-agnostic meta-learning algorithm built on top of MAML. The formulation could be extended to other algorithms with little effort. All images and tables in this post are from their paper.

Introduction

Typically, a meta-learner is trained on a variety of tasks in the hopes of being generalizable to new tasks. However, the generalizability on new tasks of a meta-learner could be fragile when it is over-trained on existing tasks during meta-training phase. In other words, the initial model of a meta-learner could be too biased towards existing tasks to adapt to new task, especially when only very few examples are available to update the model. The authors propose TAML to avoid a biased meta-learner. Specifically, they present an entropy-based approach that meta-learns an unbiased initial model with the largest uncertainty over the output labels by preventing it from over-performing in classification tasks. However, this approach is limited to discrete outputs from a model making it more amenable to classification tasks. Alternatively, a more general inequality-minimization TAML is presented for more ubiquitous scenarios by directly minimizing the inequality of initial losses beyond the classification tasks wherever a suitable loss can be defined. This makes the paradigm more ubiquitous and can be extended to other domains of regression and reinforcement learning.

Task Agnostic Meta-Learning

The problem with the current meta-learning approach is that the initial model or learner can be biased towards some tasks during the meta-training phase, particularly when future tasks in the test phase may have discrepancy from those in the training tasks. In this case, the authors wish to avoid the initial model over-performing on some tasks. Moreover, an over-performed initial model could also prevent the meta-learner to learn a better update rule with consistent performance across tasks.

Entropy-Maximization/Reduction TAML

To prevent the initial model $f_{\theta}$ from over-performing on a task, we prefer it makes a random guess over predicted labels with an equal probability so that it is not biased towards the task. This can be expressed as a maximum-entropy prior over $\theta$ so that the initial model should have a large entropy over the predicted labels. The entropy for task $\mathcal{T}_i$ is computed by sampling $x_i$ from $P_{\mathcal{T}_i}(x)$ over its output probabilities $y_{i,n}$ over $N$ predicted labels:

$\mathcal{H}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{x_i \sim P_{\mathcal{T}_i}(x)} \sum_{n=1}^{N} \widehat{y}_{i,n} \log(\widehat{y}_{i,n})$

where $\left [ y_{i,1},...,y_{i,N} \right ] = f_{\theta}(x_i)$ is the prediction by $f_{\theta}$ , which are often an output from a softmax layer in a classification task.

Alternatively, one can not only maximize the entropy before the update of initial model’s parameter, but also minimize the entropy after the update. So overall, we maximize the entropy reduction for each task $\mathcal{T}_i$ as $\mathcal{H}_{\mathcal{T}_i}(f_{\theta})-\mathcal{H}_{\mathcal{T}_i}(f_{\theta_i})$ . The minimization of $\mathcal{H}_{\mathcal{T}_i}(f_{\theta_i})$ means that the model can become more certain about the labels with a higher confidence after upding the parameter from $\theta$ to $\theta_i$ . This entropy term can be combined with the typical meta-training objective term as a regularizer to find the optimal $\theta$ , which is:

$\min_{\theta}\mathbb{E}_{\mathcal{T}_i\sim P(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{\theta_i}) + \lambda \left [ -\mathcal{H}_{\mathcal{T}_i}(f_{\theta}) + \mathcal{H}_{\mathcal{T}_i}(f_{\theta_i}) \right ]$

Unfortunately, the entropy-based TAML is subject to a critical limitation - it is only amenable to discrete labels in classification tasks to compute the entropy.

Inequality-Minimization TAML

We wish to train a task-agnostic model in meta-learning such that its initial performance is unbiased towards any particular task $\mathcal{T}_i$ . Such a task-agnostic meta-learner would do so by minimizing the inequality of its performances over different tasks.

Specifically, the bias of the initial model towards any particular tasks is minimized during meta-learning by minimizing the inequality over the losses of sampled tasks in a batch. So, given an unseen task during testing phase, a better generalization performance is expected on the new task by updating from an unbiased initial model with few examples. The key difference between both TAMLs lies that for entropy, we only consider one task at a time by computing the entropy of its output labels. Moreover, entropy depends on a particular form or explanation of output function. On the contrary, the inequality only depends on the loss, thus it is more ubiquitous. The algorithm learns to update the model parameter $\theta$ by minimizing the objective:

$\min_{\theta}\mathbb{E}_{\mathcal{T}_i\sim P(\mathcal{T})} \left [ \mathcal{L}_{\mathcal{T}_i}(f_{\theta_i}) \right ] + \lambda \mathcal{I}_{\mathcal{E}}(\{\mathcal{L}_{\mathcal{T}_i}(f_{\theta})\})$

It is worth noting that the inequality measure is computed over a set of losses from sampled tasks. The first term is the expected loss by the model $f_{\theta_i}$ after the update, while the second is the inequality of losses by the initial model $f_{\theta}$ before the update.

Inequality measures

Inequality measures are instrumental towards calculating the economic inequalities in the outcomes that can be wealth, incomes, or health related metrics. In meta-learning context, we use $\ell_i = \mathcal{L}_{\mathcal{T}_i}(f_{\theta})$ to represent the loss of a task $\mathcal{T}_i$ , $\overline{\ell}$ represents the mean of the losses over sampled tasks, and $M$ is the number of tasks in a single batch. There are few options of inequality measure that can be employed in our formulation:

Theil Index: This inequality measure has been derived from redundancy in information theory, which is defined as the difference between the maximum entropy of the data and an observed entropy. Suppose that we have $M$ losses $\{\ell_i|i=1,...,M\}$ , then Thiel Index is defined as:

$T_T = \frac{1}{M}\sum _{i=1}^M \frac{\ell_i}{\overline{\ell}} \ln \frac{\ell_i}{\overline{\ell}}$

Generalized Entropy Index: Generalized entropy index has been proposed to measure the income inequality. It is not a single inequality measure, but is a family that includes many inequalitiy measures like Thiel Index, Thiel L, etc. When $\alpha$ is zero, it is called a mean log deviation of Thiel L, and when $\alpha$ is one, it is actually Thiel Index. A larfer GE $\alpha$ value makes this index more sensitive to differences at the upper part of the distribution, and a smaller $\alpha$ value makes it more sensitive to difference at the bottom of the distribution. For some real value $\alpha$ , it is defined as:

$\mathit{GE}(\alpha) = \left\{\begin{matrix} \frac{1}{M\alpha(\alpha-1)}\sum_{i=1}^M \left [ \left ( \frac{\ell_i}{\overline{\ell}} \right)^{\alpha} -1 \right ], & \alpha \neq 0,1,\\ \frac{1}{M} \sum _{i=1}^M \frac{\ell_i}{\overline{\ell}} \ln \frac{\ell_i}{\overline{\ell}}, & \alpha=1\\ -\frac{1}{M} \sum _{i=1}^M \ln \frac{\ell_i}{\overline{\ell}}, & \alpha=0\\ \end{matrix}\right.$

Atkinson Index: It is another measure for income inequality which is useful in determining which end of the distribution contributed the most to the observed inequality. Here, $\epsilon$ is called “inequality aversion parameter”. When $\epsilon = 0$ , the index becomes more sensitive to the changes in upper end of the distribution, and when it approaches to 1, the index becomes more sensitiveto the changes in lower end of the distribution. It is defined as:

$A_{\epsilon} = \left\{\begin{matrix} 1-\frac{1}{\mu}\left ( \frac{1}{M}\sum _{i=1}^M \ell_{i}^{1-\epsilon}\right )^{\frac{1}{1-\epsilon}}, & \text{for } 0\leq \epsilon \neq 1\\ 1-\frac{1}{\overline{\ell}}\left ( \frac{1}{M} \prod _{i=1}^M \ell_i\right )^{\frac{1}{M}}, & \text{for } \epsilon = 1\\ \end{matrix}\right.$

Gini-Coefficient: It is usually defined as the half of relative absolute mean difference. Gini-coefficient is more sensitive to deviation around the middle of the distribution than at the upper or lower part of the distribution.

$G = \frac{\sum_{i=1}^M \sum_{j=1}^M \left | \ell_i - \ell_j \right |}{2n \sum_{i=1}^M \ell_i}$

Variance of Logarithms: This metric has a greater emphasis on the lower losses of the distribution. Here $g(\ell)$ is defined as the geometric mean of the distribution.

$V_{L}(\ell) = \frac{1}{M} \sum_{i=1}^M \left [ \ln \ell_i - \ln g(\ell) \right ]^2$

Conclusion

The authors show the performance of their proposed TAML model on several experiments from classification to reinforcement-learning problems.

Written on November 15, 2021