Learning Representations by back-propagating errors

An overview of the paper “Learning Representations by back-propagating errors”. The paper’s proposes a novel approach for learning in networks. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure in the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal hidden units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. All images and tables in this post are from their paper.

Network

The total input, $x_j$ , to unit $j$ is a linear function of the outputs $y_j$ , of the units that are connected to $j$ and of weights $w_{ji}$ on these connections. Units can be given biases by introducing an extra input to each unit which always has a 1. The weight on this extra input is called bias and is equivalent to a threshold of the opposite sign. Let us use a simple error function such as mean square error. To minimize $\mathbb{E}$ by gradient descent, it is necessary to compute the partial derivative of $\mathbb{E}$ wrt each weight in the network. For a given case, the partial derivatives of the error wrt each weight are computed in two passes. The forward pass is used to compute the output. The backward pass is more complicated. We first compute gradient with respect to $y$ (output variable). Then, by chain rule, we compute derivative of $\mathbb{E}$ wrt $x$ as follows:

$\frac{\partial \mathbb{E}}{\partial x} = \frac{\partial \mathbb{E}}{\partial y} \frac{\partial y}{\partial x}$

If $y$ is a sigmoid function,then $\frac{\partial y}{\partial x} = y(1-y)$ . This means that we know how a change in the total input $x$ to an output unit will affect the error. But, this total input is just a linear function of the states of the lower level units and it is also a linear function of the weights on the connections, so it is easy to compute how the error will be effected by changing these states and weights.

$\frac{\partial \mathbb{E}}{\partial w_{ji}} = \frac{\partial \mathbb{E}}{\partial x_{j}} \frac{\partial x_{j}}{\partial w_{ji}} = \frac{\partial \mathbb{E}}{\partial x_{j}}y_i$

To compute $\frac{\partial \mathbb{E}}{\partial y}$ , we need to solve the following:

$\frac{\partial \mathbb{E}}{\partial y_{i}} = \frac{\partial \mathbb{E}}{\partial x_{j}} \frac{\partial x_{j}}{\partial y_{i}} = \frac{\partial \mathbb{E}}{\partial x_{j}}w_{ji}$

Hence, $\frac{\partial \mathbb{E}}{\partial y_i}$ is just summation of all $\frac{\partial \mathbb{E}}{\partial x_j}w_{ji}$ . This basic formula is the basis of the backpropagation algorithm used in neural networks.

Written on August 10, 2020