A Gentle Intro to Backpropagation in Neural Networks

AND/OR Gate Perceptron Approximator

Let's illustrate the math behind the popular backpropagation algorithm used for training neural nets by creating a one-layer neural network (a 'perceptron') that can approximate the logical AND and OR gates.

And Gate Truth Table:

x0 | x1 | y
0   | 0   | 0
1   | 0   | 0
0   | 1   | 0
1   | 1   | 1

We'll model this as a regression problem: our neural network will predict one float value. If this value is above .5, we'll say the network predicted a 1, else it predicted a 0.

Our neural network will be a one-layer neural network with a sigmoid activation function:

$$output = \sigma(\bf{W^T}*\bf{X} + \bf{b})$$

The sigmoid function is: $$\sigma = 1 / (1+e^{-x})$$

$\bf{W}$ is a 2x1 matrix
$\bf{b}$ is a 1x1 scalar
Therefore, the output should be a scalar: $[2x1]^T * [2x1] + [1x1] = [1x1]$

Because this is a regression problem, let's use mean-squared-error as the error function.

$$E = error(\hat{y}, y) = (\hat{y} - y)^2$$

Gradient Descent

The algoithm via which network weights and biases are modified is called backpropagation.

  1. The partial derivative of the error (the gradient) with respect to the weights/biases will tell us which direction to shift the weights/biases.
  2. Because the network is really just a nested function, you can use the chain rule to calculate this partial derivative. This is the equation for the whole network: $$output = \frac{1}{1+e^{-1*(\bf{W^T*X+b})}}$$ If we include the error calculation: $$E = (\frac{1}{1+e^{-1*(\bf{W^T*X+b})}} - y)^2$$
  3. If we start taking the partial derivative of the error with respect to the weights, we'll get these terms via the chain rule: $$\frac{\delta E}{\delta \bf{W}} = \frac{\delta E}{\delta \hat{y}} \frac{\delta \sigma}{\delta out_1} \frac{\delta out_1}{\delta \bf{W}}$$

    ( Remember that $\hat{y} = \sigma(out_1)$, and $out_1 = \bf{W^T}*\bf{X} + \bf{b}$ )
    Now, we can calculate the separate parts:
    $\frac{\delta E}{\delta \hat{y}} = 2 * (\hat{y} - y)\\ \frac{\delta \sigma}{\delta out_1} = \sigma(out_1) * (1 - \sigma(out_1))\\ \frac{\delta out_1}{\delta \bf{W}} = \bf{X}$
    Putting the parts together: $$\frac{\delta E}{\delta \bf{W}} = [2 * (\hat{y} - y)] * [\sigma(out_1) * (1 - \sigma(out_1))] * [\bf{X}]$$

  4. Similarly, calculating the gradient for the biases: $$\frac{\delta E}{\delta \bf{b}} = \frac{\delta E}{\delta \hat{y}} \frac{\delta \sigma}{\delta out_1} \frac{\delta out_1}{\delta \bf{b}}$$    Each of these terms is: $\frac{\delta E}{\delta \hat{y}} = 2 * (\hat{y} - y)\\ \frac{\delta \sigma}{\delta out_1} = \sigma(out_1) * (1 - \sigma(out_1))\\ \frac{\delta \sigma}{\delta \bf{b}} = 1$    Putting it together: $$\frac{\delta E}{\delta \bf{b}} = [2 * (\hat{y} - y)] * [\sigma(out_1) * (1 - \sigma(out_1))] * [1]$$

  1. Updating the weights and biases involves scaling the gradient by the error and a scalar called the _learningrate (called $\alpha$ here)
$$\bf{W} = \bf{W} - \alpha * error * \frac{\delta E}{\delta \bf{W}}$$$$\bf{b} = \bf{b} - \alpha * error * \frac{\delta E}{\delta \bf{b}}$$

Let's generate and train some weights

We're really overfitting on the data, but since four examples are all we'll ever see, overfitting isn't really a concern in this case.

Let's train and test our perceptron on an AND Gate
Let's try an OR Gate

As you can see, the perceptron has done a great job predicting the outputs of AND and OR gates.

You'll notice that the more epochs you train for, the more you'll overfit and the divergence in the raw predictions values between classes will correspondingly increase.

Limitations of Perceptrons

No matter how long you train for, however, you can't accurately predict an XOR gate's outputs because its outputs are non-linear w.r.t. the inputs. You can see the raw predictions for all inputs are just about the same (very close to .5).

Multi-layer perceptrons, however, CAN be made to accurately predict on non-linear data, because they have more than one layer. Backpropagation for MLPs is similar to single-layer perceptrons; you just have more terms and partial derivatives to deal with.