Let's illustrate the math behind the popular backpropagation algorithm used for training neural nets by creating a one-layer neural network (a 'perceptron') that can approximate the logical AND and OR gates.
And Gate Truth Table:
x0 | x1 | y
0 | 0 | 0
1 | 0 | 0
0 | 1 | 0
1 | 1 | 1
We'll model this as a regression problem: our neural network will predict one float value. If this value is above .5, we'll say the network predicted a 1, else it predicted a 0.
Our neural network will be a one-layer neural network with a sigmoid activation function:
$$output = \sigma(\bf{W^T}*\bf{X} + \bf{b})$$The sigmoid function is: $$\sigma = 1 / (1+e^{-x})$$
$\bf{W}$ is a 2x1 matrix
$\bf{b}$ is a 1x1 scalar
Therefore, the output should be a scalar: $[2x1]^T * [2x1] + [1x1] = [1x1]$
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-1 * x))
def forward(weights, bias, x):
x = x + 1e-10
out = sigmoid(np.dot(weights.T, x) + bias)[0][0]
return out, int(out > 0.5)
Because this is a regression problem, let's use mean-squared-error as the error function.
$$E = error(\hat{y}, y) = (\hat{y} - y)^2$$The algoithm via which network weights and biases are modified is called backpropagation.
If we start taking the partial derivative of the error with respect to the weights, we'll get these terms via the chain rule: $$\frac{\delta E}{\delta \bf{W}} = \frac{\delta E}{\delta \hat{y}} \frac{\delta \sigma}{\delta out_1} \frac{\delta out_1}{\delta \bf{W}}$$
Similarly, calculating the gradient for the biases: $$\frac{\delta E}{\delta \bf{b}} = \frac{\delta E}{\delta \hat{y}} \frac{\delta \sigma}{\delta out_1} \frac{\delta out_1}{\delta \bf{b}}$$ Each of these terms is: $\frac{\delta E}{\delta \hat{y}} = 2 * (\hat{y} - y)\\ \frac{\delta \sigma}{\delta out_1} = \sigma(out_1) * (1 - \sigma(out_1))\\ \frac{\delta \sigma}{\delta \bf{b}} = 1$ Putting it together: $$\frac{\delta E}{\delta \bf{b}} = [2 * (\hat{y} - y)] * [\sigma(out_1) * (1 - \sigma(out_1))] * [1]$$
def backpropagate(x, y, weights, bias, learning_rate):
x = x + 1e-10
y = y + 1e-10
# caculate prediction
out_1 = np.dot(weights.T, x) + bias
out_2 = sigmoid(out_1)
# calculate error
error_1 = out_2 - y
error_2 = np.square(error_1)
# Gradient Descent Step
dE_dy = 2 * error_1
dsigma_dout1 = out_2 * (1 - out_2)
# calculate gradient of error w.r.t weights
dE_dW = dE_dy * dsigma_dout1 * x
# calculate new weights
new_w = weights - learning_rate * error_2 * dE_dW
# calculate gradient of error w.r.t bias
dE_db = dE_dy * dsigma_dout1 * 1
# calculate new biases
new_b = bias - learning_rate * error_2 * dE_db
return new_w, new_b
We're really overfitting on the data, but since four examples are all we'll ever see, overfitting isn't really a concern in this case.
def train(data, epochs=150):
weights_shape, bias_shape = (2,1), (1,)
weights = np.zeros(weights_shape) + 1e-2
bias = np.zeros(bias_shape)
learning_rate = .1
for _ in range(epochs):
for batch in data:
x = np.array(batch[:2])[:, None]
y = np.array(batch[ 2])
weights, bias = backpropagate(x, y,
weights,
bias,
learning_rate)
return weights, bias
and_gate_data = [(0,0,0), (0,1,0), (1,0,0), (1,1,1)]
print("AND Gate:")
print("\ta\tb\toutput\n{}".format('\t' + '\n\t'.join(['\t'.join([str(j) for j in i]) for i in and_gate_data])))
print("\nWeights and Biases:")
weights, bias = train(and_gate_data)
print(f'\tWeights: {weights[:,0]}')
print(f'\tBias: {bias[0]}')
print("\nPredictions:")
for (a, b, c) in and_gate_data:
raw_pred, pred = forward(weights, bias, np.array([[a],[b]]))
print(f'\tInput: {[a,b]}\t'
f'\tTruth: {c}\t'
f'\tPrediction: {round(raw_pred,6)} => {pred}')
AND Gate: a b output 0 0 0 0 1 0 1 0 0 1 1 1 Weights and Biases: Weights: [0.39349878 0.39069549] Bias: [-0.71500872] Predictions: Input: [0, 0] Truth: 0 Prediction: 0.328493 => 0 Input: [0, 1] Truth: 0 Prediction: 0.419625 => 0 Input: [1, 0] Truth: 0 Prediction: 0.420308 => 0 Input: [1, 1] Truth: 1 Prediction: 0.517289 => 1
or_gate_data = [(0,0,0), (0,1,1), (1,0,1), (1,1,1)]
print("AND Gate:")
print("\ta\tb\toutput\n{}".format('\t' + '\n\t'.join(['\t'.join([str(j) for j in i]) for i in or_gate_data])))
print("\nWeights and Biases:")
weights, bias = train(or_gate_data)
print(f'\tWeights: {weights[:,0]}')
print(f'\tBias: {bias[0]}')
print("\nPredictions:")
for (a, b, c) in or_gate_data:
raw_pred, pred = forward(weights, bias, np.array([[a],[b]]))
print(f'\tInput: {[a,b]}\t'
f'\tTruth: {c}\t'
f'\tPrediction: {round(raw_pred,6)} => {pred}')
AND Gate: a b output 0 0 0 0 1 1 1 0 1 1 1 1 Weights and Biases: Weights: [0.62831802 0.62989494] Bias: [-0.02036241] Predictions: Input: [0, 0] Truth: 0 Prediction: 0.49491 => 0 Input: [0, 1] Truth: 1 Prediction: 0.647834 => 1 Input: [1, 0] Truth: 1 Prediction: 0.647474 => 1 Input: [1, 1] Truth: 1 Prediction: 0.77519 => 1
As you can see, the perceptron has done a great job predicting the outputs of AND and OR gates.
You'll notice that the more epochs you train for, the more you'll overfit and the divergence in the raw predictions values between classes will correspondingly increase.
xor_gate_data = [(0,0,0), (0,1,1), (1,0,1), (1,1,0)]
weights, bias = train(xor_gate_data, epochs=50000)
print(f'Weights: {weights[:,0]}')
print(f'Bias: {bias[0]}\n')
for (a, b, c) in and_gate_data:
raw_pred, pred = forward(weights, bias, np.array([[a],[b]]))
print(f'Input: {[a,b]}\t'
f'Truth: {c}\t'
f'Pred: {round(raw_pred,6)} => {pred}')
Weights: [-0.01261855 -0.00630927] Bias: [0.00630927] Input: [0, 0] Truth: 0 Pred: 0.501577 => 1 Input: [0, 1] Truth: 0 Pred: 0.5 => 1 Input: [1, 0] Truth: 0 Pred: 0.498423 => 0 Input: [1, 1] Truth: 1 Pred: 0.496845 => 0
No matter how long you train for, however, you can't accurately predict an XOR gate's outputs because its outputs are non-linear w.r.t. the inputs. You can see the raw predictions for all inputs are just about the same (very close to .5).
Multi-layer perceptrons, however, CAN be made to accurately predict on non-linear data, because they have more than one layer. Backpropagation for MLPs is similar to single-layer perceptrons; you just have more terms and partial derivatives to deal with.