Meta-Learning for Batch Mode Active Learning

An overview of the paper “Meta-Learning for Batch Mode Active Learning”. In this paper, the authors propose a method to construct the best set of unlabeled items to label given a classifier trained on a small training set. All images and tables in this post are from their paper.

Problem Statement

The majority of popular approaches are based on heuristics such as choosing the item whose label the model is most uncertain about, choosing the item whose addition will cause the model to be least uncertain about other items, or choosing the item that is most “different” compared to other unlabeled items according to some similarity function . However, there are some limitations when extending these heuristics to batch setting:

Suboptimal performance and produce sets with overly redundant items
Complexity for selecting each new item that is atleast quadratic, making them prohibitive to use for large unlabeled datasets.
It is assumed that unlabeled items belong to at least one of the classes we are interested in classifying; how this is not the case always. The data will often consist of distractor items, that do not belong to any one of the classes.

Method

The method involves supplementing the support and query sets with an unlabeled set $U = \{\widetilde{x}_1,...,\widetilde{x}_m\}$ that consists of $M$ unlabeled examples. We consider $K$ -shot, $N$ -class, $B$ -batch episodes where we need to select a subset $A\subseteq U$ of size $B$ to be labeled and added to our support set $S$ to get a new support set ${S}' = S\cup A$ . The goal is to use the classifier formed from the original support set $S$ to select the best subset of $B$ examples from $U$ to label to create the new support set ${S}'$ and associated new classifier so as to most improve the performance on the query set $Q$ .

We can calculate a set of statistics relating each unlabeled item $\widetilde{x}_i \in U$ to the set of prototypes and we denote this set of item-classifier statistics by $\prod (\{c_k\}^K_{k=1},\widetilde{x}_i)$ . These statistics are used to compute two distributions quality and diversity, which represent two distributions over which unlabeled item to add next to the existing subset $A$ .

Quality Distribution: The probability of selecting an unlabeled item according to its quality is defined as $p_{quality}(\widetilde{x}_i) \propto \exp(q_i)$ , where:

$q_i = f_q(\prod (\{c_k\}^K_{k=1},\widetilde{x}_i))$

$f_q$ is a MLP with parameters $q$ . This distribution independently maps the probability of each unlabeled item being selected based on a prediction of how useful the item will be to the existing classifier according to a learned function of item-classifier statistics.

Diversity Distribution: The same set of statistics can also be used to compute a feature vector describing the unlabeled item to classifier relationship as:

$\phi_i = f_{\phi}(\prod(\{c_k\}^K_{k=1}, \widetilde{x}_i))$

$f_{\phi}$ is a MLP with parameters $\phi$ . The goal of the diversity distribution is to increase the probability of selecting unlabeled items which are dissimilar from the items that already make up the set $A$ where similarity between each item’s corresponding feature vector. The probability of selecting an unlabeled item according to its diversity is then:

$p_{diversity} \propto \exp(v(\phi_i)/\tau)$

where $v(\phi_i) = \min_{\widetilde{x}_j \in A}\{\sin\theta_{ij}\}$ . Here, $\theta_{ij}$ is the angle between feature vectors $\phi_i$ and $\phi_j$ and $\tau$ is a learned temperature parameter that allows us to control the flatness of this distribution. The probability of an item being picked increases as its feature vector is more orthogonal to feature vectors corresponding to items already having been added to the subset $A$ .

Product of Experts: The final probability distribution is attained as a product of experts model combining the distributions of quality and diversity.

Training

The model is trained on a loss such that the final accuracy of the query set is improved, and all the parameters ${\theta}' = \{\phi,q,\tau\}$ are learned. The model is trained in an episodic fashion and new batches are sampled based on previous probabilities.

Written on June 7, 2021