Semi-supervised classification with graph convolutional networks

An overview of the paper “Semi-supervised classification with graph convolutional networks”. The author presents a scalable approach for semi-supervised learning on graph structured data that is based on an efficient variant of CNN which operate directly on graphs. All images and tables in this post are from their paper.

Introduction

Here, the authors consider a problem of classifying nodes in a graph, where labels are only available for a small subset of nodes. This problem can be framed as graph-based semi-supervised learning, where label information is smoothed over the graph via same form of explicit graph-based regularization by using a graph laplacian regularization term in the loss function $L = L_0 + \lambda L_{reg}$ where:

$L_{reg} = \sum _{i,j} A_{ij}\left \| f(X_i)-f(X_j) \right \|^2 = f(X)^T\bigtriangleup f(X)$

Here $L_0$ denotes the supervised loss w.r.t. the labeled part of the graph, $f(.)$ can be a neural network like differentiable function, $\lambda$ is a weighing factor and $X$ is a matrix of node features. $\bigtriangleup =D-A$ denotes the unnormalized graph Laplacian of an undirected graph. The above equation relies on the assumption that connected nodes in the graph are likely to share the same label. This assumption, however, might restrict model capacity, as graph edges need not necessarily encode node similarity, but could contain additional information.

Fast approximate convolutions on graphs

Here, we consider a multi-layer Graph Covolutional Network with the following layer-wise propagation rule:

$H^{(l+1)}=\sigma(\widetilde{D}^{-0.5} \widetilde{A} \widetilde{D}^{-0.5} H^{(l)} W^{(l)})$

Here, $\widetilde{A} = A + I_N$ is the adjacency matrix of the undirected graph with added self-connections. $\widetilde{D}$ is the degree matrix of $\widetilde{A}$ . A is the adjacency matrix + self loops. This is done so that each node includes its own features at its next representaion. D is the degree matrix of A. Degree matrix is used to normalize nodes with high degrees. $H^{(l)}$ denotes the activations of the last layer such that $H^{0} = X$

Spectral Graph convolutions

Here, we consider spectral convolutions on graphs defines as the multiplation of a signal $x \in \mathbb{R}^N$ (a scalar for every node) with a filter $g_{\Theta} = diag(\Theta)$ parametrized by $\theta \in \mathbb{R}^N$ , i.e.:

$g(\theta)\star x = UG_{\theta}U^Tx$

We can understand $g_{\theta}$ as a function of the eigenvalues of the normalized graph Laplacian matrix $L = I_N - D^{-0.5}AD^{-0.5} = U \wedge U^T$ , with a diagonal matrix of its eigenvalues $\wedge$ and $U^Tx$ is the graph fourier transform of $x$ . However, computing the equation $g(\theta)\star x$ is computationally expensive. To circumvent this problem, it was suggested that $g(\theta)$ can be approximated by a truncated expansion in terms of Chebyshev polynomials $T_k(x)$ upto an order of $K$ . This approximation only depends only on nodes that are at maximum $K$ steps away from the central node.

Layer wise linear Model

A neural network model based on graph colvolutions can therefore be built by stacking multiple convolutional layers of the form described above, each layer followed by a point-wise non-linearity. The idea is that we can recover a rich class of convolutional filter functions by stacking such layers. Intutively, such networks can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions.

$g_{\theta} \star x \approx {\theta_0}'x - {\theta_1}'D^{-0.5}AD^{-0.5}x$

In this linear formulation, we only consider immediate neighbors of the node. Successive application of such filters of this form then effectively convolve the $k^{th}$ order neigborhood of a node, where $K$ is the number of successive filterning operations or convolutional layers in a neural network model. To cut down on the number of parameters, we use $\theta = {\theta_0}'=-{\theta_1}'$ such that above equation is approximated to:

$g_{\theta} \star x \approx \theta (I_N + D^{-0.5}AD^{-0.5})x$

However, the term $I_N + D^{-0.5}AD^{-0.5}$ now has eigenvalues in the range [0,2]. Repeated application of this operator can therefore lead to numerical instabilities and exploding/vanishing gradiesnt when used in a deep learning model. To alleviate this problem, they introduced the renormalization trick, of using $I_N + D^{-0.5}AD^{-0.5} \rightarrow \widetilde{D}^{-0.5}\widetilde{A}\widetilde{D}^{-0.5}$ , whose components have been discussed earlier. We finally obtain

$Z = \widetilde{D}^{-0.5} \widetilde{A} \widetilde{D}^{-0.5} X \Theta$

which is intuitively similar to computing $X \theta$ and propagating these outputs across all nodes using the adjacency matrix.

Graph Convolutional Network

Now, all that’s left is to define a loss function and update $\Theta$ using backpropagation and gradient descent.

Hidden Layer Activations using TSNE

Written on November 22, 2020