Representation Learning with Contrastive Predictive Coding

An overview of the paper “Representation Learning with Contrastive Predictive Coding”. The authors propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which they call Contrastive Predictive Coding. All images and tables in this post are from their paper.

The key insight of the model is to learn such representations by predicting the future in latent space by powerful autoregressive models. Despite the importance of unsupervised learning, it is yet to see a breakthrough similar to supervised learning: modeling high-level representations from raw observations remain elusive. Furthermore, it is not always clear what the ideal representation is and it it is possible that one can learn such a representation without additional supervision or specialization to a particular data modality.

One of the most common strategies for unsupervised learning has been to predict future, missing or contextual information. This idea of predictive coding is pretty old idea. Recent unsupervised learning has successfully used these ideas to learn word representations by predicting neighboring words.

Contrastive Predicting Coding

Motivation and Intuitions

The main intuition behind our model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal. At the same time, it discards low-level information and noise that is more local. One of the challenges of predicting high-dimensional data is that unimodal losses such as meansquared error and cross-entropy are not very useful, and powerful conditional generative models which need to reconstruct every detail in the data are usually required. But these models are computationally intense, and waste capacity at modeling the complex relationships in the data $x$ , often ignoring the context $c$ . For example, images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories). This suggests that modeling $p(x|c)$ directly may not be optimal for the purpose of extracting shared information between $x$ and $c$ . When predicting future information they instead encode the target $x$ (future) and context $c$ (present) into a compact distributed vector representations (via non-linear learned mappings) in a way that maximally preserves the mutual information of the original signals $x$ and $c$ defined as

$I(x;c) = \sum _{x,c} p(x,c)\log \frac{p(x|c)}{p(x)}$

By maximizing the mutual information between the encoded representations, they extract the underlying latent variables the inputs have in common.

Contrastive Predictive Coding

First, a non-linear encoder $g_{enc}$ maps the input sequence of observations $x_t$ to a sequence of latent representations $z_t = g_{enc}(x_t)$ , potentially with a lower temporal resolution. Next, an autoregressive model $g_{ar}$ summarizes all $z_{\leq t}$ in the latent space and produces a context latent representation $c_t = g_{ar}(z_{\leq t})$ . they do not predict future observations $x_{t+k}$ directly with a generative model $p_k(x_{t+k}|c_t)$ . Instead, they model a density ratio which preserves the mutual information between $x_{t+k}$ and $c_{t}$ as follows:

$f_k(x_{t+k},c_t) \propto \frac{p(x_{t+k}|c_t)}{p(x_{t+k})}$

they can use a simple log-bilinear model:

$f_k(x_{t+k},c_t) = \exp(z_{t+k}^T W_k c_t)$

By using a density ratio $f(x_{t+k},c_t)$ and inferring $f(x_{t+k},c_t)$ with an encoder, they relieve the model from modeling the high dimensional distribution $x_{t_k}$ .

Overview of Contrastive Predictive Coding.

InfoNCE Loss and Mutual Information Estimation

Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which they call InfoNCE. Given a set $X = \begin{Bmatrix} x_1,...,x_N \end{Bmatrix}$ of $N$ random samples containing one positive sample from $p(x_{t+k}|c_t)$ and $N-1$ negative samples from the ‘proposal’ distribution $p(x_{t+k})$

$L_N = -\mathbb{E}_{X}\begin{bmatrix} \log \frac{f_k(x_{t+k}|c_t)}{\sum _{x_j\in X}f_k(x_j,c_t)} \end{bmatrix}$

Optimizing the above function will result in the final output estimating the density ratio. Furthermore, the authors showed that $I(x_{t+k},c_t)\geq \log(N) - L_N$ , which becomes tighter as $N$ becomes larger.

Written on March 22, 2021