Here you just find a short summary (math and implementation with python) of the forward and backward pass of feedforward neural networks.
If you a new to neural networks first read the first two chapters of the online book Neural Network and Deep Learning or the first chapters from a neural networks text book (see the literature list below). For a less dense derivation of backpropagation and a tutorial implementation of neural networks with python see the Hacker's guide to Neural Networks.
The activities of the layers can be written in vector form:
with $1 \leq i \leq n^{(l)}$ with $n^{(l)}$ neurons in the layer $l$. The number of units $n^{(l)}$ can be different for each layer.
Each layer $l$ (except the output layer $l=L$) has a bias unit $a_0^{(l)} = 1$.
The activities of the neurons -except for the input layer- are determined by the activity vector of the layer below.
For a full connected neural network for each unit $i$ the weighted sum of the activities of the layer below is:
$$ z_i^{(l+1)} = \sum_{j=1}^{n} w_{ij}^{(l)} a_j^{(l)} + b_{j} $$In vector-matrix form (affine transformation):
$$ \vec z^{(l+1)} = W^{(l)} \cdot \vec a^{(l)} + \vec b^{(l)} $$All vectors are column vectors.
Note: That can equivalently written with row vectors $\vec v'$:
$$ \vec z'^{(l+1)} = \vec a'^{(l)} \cdot W'^{(l)} + \vec b'^{(l)} $$$W'$ is the transpose of $W$, i.e. $w_{ij}=w'_{ji}$ with:
The activations $\vec a^{(l)}$ for each layer (except the input) are computed from $\vec z^{(l)}$ by application of an activation function $g^{(l)}(\vec z^{(l)})$ (element-wise).
$$ \vec a^{(l+1)} = g^{(l+1)}(\vec z^{(l+1)}) = g^{(l+1)}(W^{(l)} \cdot \vec a^{(l)} + \vec b^{(l)}) $$Typically $a(.)$ is a non-linear function.
From the difference between the target value $\vec y^{(m)}$ and the net output $\vec o^{(m)}$ a cost function $E^{(m)}$ for each training example $m$ can be computed, e.g. cross-entropy error. The total cost of all training examples is given by
$$ E = \sum_{m} E^{(m)} $$The goal of the training procedure is to minimize the training error. To archive this goal the parameters (connection weights) should be modified.
We write for all parameters for a layer $\theta_{ij}^{(l)}$ (instead of $w_{ij}^{(l)}$ and the bias $b_j^{(l)}$ - see below)
For small differences $\Delta E$ this can be approximated by:
$$ \Delta E \approx \sum \frac{\partial E}{\partial \theta^{(l)}_{ij}} \Delta \theta^{(l)}_{ij} $$To train a neural network with (stocastic) gradient decent (SGD) the input $\vec x^{(m)}$ of a (random) training example $(\vec x^{(m)}, \vec y^{(m)})$ is feed in the input layer.
Then after computing all ${\partial E^{(m)}}/{\partial \theta^{(l)}_{ij}}$ the weights are modified by
$$ \theta^{(l)}_{ij} \leftarrow \theta^{(l)}_{ij} - \alpha \frac{\partial E^{(m)}}{\partial \theta^{(l)}_{ij}} $$$\alpha$ is a hyperparameter of the training procedure, called the learning rate.
The the following we drop the index $(m)$ of $E$. (The same holds for batch and mini-batch learning).
With the famous backpropagation algorithm it's possible to compute efficiently all ${\partial E}/{\partial \theta^{(l)}_{ij}}$.
For mathematical convenience $w_{ij}^{(l)}$ and the bias $b_j^{(l)}$ are written together as $\theta_{ij}^{(l)}$ with $\theta_{0j}^{(l)} = b_{j}^{(l)}$ and defining $a_0^{(l)}:=1$:
$$ z_i^{(l+1)} = \sum_{j=0}^{n} \theta_{ij}^{(l)} a_j^{(l)} $$In vector-matrix form ($\vec a^{(l)}$ has now an additional element $a_0^{(l)}:=1$): $$ \vec z^{(l+1)} = \Theta^{(l)} \vec a^{(l)} $$
If $a$ is a function of $b$ this can be written as $a = a(b)$. If $b$ is also function of another variable $c$ therefore $a$ is indirectly also a function of $c$ (transitivity).
The following relation holds for the derivatives: $$ \frac{d a(b(c))}{dc} = \frac{d a(b)}{db} \frac{db(c)}{dc} $$
Analog for partial derivatives: $$ \frac{\partial a(\vec b(\vec c))}{\partial c_i} = \sum_j \frac{\partial a}{\partial b_j} \frac{\partial b_j}{\partial c_i} $$
To compute the derivates of the costs of the $\theta$'s, note that $E$ is indirectly dependent on $\theta$ over $\vec o$
$$ E(\theta) = E(\vec o(\theta)) $$$$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial o_i} \frac{\partial o_i}{\partial \theta^{(L-1)}_{ij}} $$with the output activation $f(.)=g^{(L)}(.)$ $$ o_i = f(z_i^{(L)}) = f(\sum_j \theta^{(L-1)}_{ij} a_j^{(L-1)}) $$
and application of the chain rule: $$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial o_i} \frac{\partial f(z_i^{(L)})}{\partial z_i^{(L)}} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial z_i^{(L)}} a_j^{(L-1)} $$
Using the corresponding activation function $f$ for the cost function this simplifies to (see e.g.[Bi95]):
$$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = (o_i - t_i) a_j^{(L-1)} $$e.g. for squared-error loss ($E^{(m)} = 1/2 (o_i-t_i)^2$) and the identity (activation) function:
$$ \frac{\partial E}{\partial z_i^{(L)}} = \frac{\partial E}{\partial o_i} \frac{\partial f(z_i^{(L)})}{\partial z_i^{(L)}} = 1/2 \frac{\partial (o_i-t_i)^2}{\partial o_i} \frac{\partial z_i^{(L)}}{\partial z_i^{(L)}} = (o_i-t_i) $$The partial derivative of $E$ to $z_i$ is called error (responsibility): $$ \delta_i^{(L)} = \frac{\partial E}{\partial z_i^{(L)}} $$
Note that the function for computing an output unit $o_i$ dependent on the corresponding weight $\theta^{(L-2)}_{jk}$ is:
$$ o_i = f(\sum_j \theta^{(L-1)}_{ij} a_j^{(L-1)}) = f(\sum_j \theta^{(L-1)}_{ij} g(z_j^{(L-1)})) = f(\sum_j \theta^{(L-1)}_{ij} g(\sum_k \theta^{(L-2)}_{jk} a_k^{(L-2)})) $$The cost (as function of $\theta^{(L-2)}_{jk}$) is in general dependent on all output units. The chain rule tells us that we must sum over all of them:
$$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \sum_i \frac{\partial E}{\partial z_i^{(L)}} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} $$With the errors $\delta_i^{(L)}$ (see above): $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \sum_i \delta_i^{(L)} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} $$
Note that we can reuse the values of errors $\delta_i^{(L)}$ which we have already computed.
Therefore we just need to compute the second factor: $$ \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} = \frac{\partial z_i^{(L)}}{\partial g(z_j^{(L-1)})} \frac{\partial g(z_j^{(L-1)})}{\partial \theta^{(L-2)}_{jk}} = $$ $$ \frac{\partial g(\theta_{ij}^{(L-1)} z_j^{(L-1)})} {\partial g(z_j^{(L-1)})} \frac{\partial g(z_j^{(L-1)})}{\partial z_j^{(L-1)}} \frac{\partial z_j^{(L-1)}}{\partial \theta^{(L-2)}_{jk}} = $$ $$ \theta_{ij}^{(L-1)} g'(z_j^{(L-1)}) \frac{\partial (a_k^{(L-2)} \theta^{(L-2)}_{jk})}{\partial \theta^{(L-2)}_{jk}}= $$ $$ \theta_{ij}^{(L-1)} g'(z_j^{(L-1)}) a_k^{(L-2)} $$
So we have: $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = a_k^{(L-2)} g'(z_j^{(L-1)})\sum_i \delta_i^{(L)} \theta_{ij}^{(L-1)} $$
Again the error of unit $j$ is defined as $$ \delta_j^{(L-1)} = \frac{\partial E}{\partial z_j^{(L-1)}} = g'(z_j^{(L-1)}) \sum_i \delta_i^{(L)} \theta_{ij}^{(L-1)} $$
This definition holds for the errors of all hidden units. The error of an hidden layer can be computed with the errors of the layer above. This can be interpreted as the propagation of an "error signal" back through the network.
Note that we reuse the new errors $\delta_j^{(L-1)}$ again for deriving the partial derivatives for the layer $L-2$ and so on: $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \frac{\partial E}{\partial z_j^{(L-1)}} \frac{\partial z_j^{(L-1)}}{\partial \theta^{(L-2)}_{jk}} = \delta_j^{(L-1)} a_k^{(L-2)} $$
Note:
With the Hadarmard product '$\circ$' (element-wise multiplication) this can be written in matrix-vector form: $$ \vec \delta^{(l)} = g'(\vec z^{(l)}) \circ ( (\Theta^{(l)})^T \cdot \vec \delta^{(l+1)} ) $$
The error backpropagation can be exploited to compute efficiently the derivates $\partial E / \partial \theta $ of all layers.
The derivatives of the cost function can now be computed for a layer as a matrix $\frac{\partial E}{\partial \theta_{ij}^{(l)}}$:
$$ \frac{\partial E}{\partial \Theta^{(l)}} = \vec a^{(l)} \otimes \vec \delta^{(l+1)} $$'$\otimes$' represents the outer product.
So with the back-propagated errors and the forward-propagated activations the learning rule becomes: $$ \Theta^{(l)} \leftarrow \Theta^{(l)} - \alpha (\vec a^{(l)} \otimes \vec \delta^{(l+1)}) $$
np.random.uniform?
import numpy as np
def logistic_function(x):
return 1./(1 + np.exp(x))
def cross_entropy_loss(t, o):
return - (t * np.log(o) + (1 - t) * np.log(1 - o)).sum()
class FeedForwardNeuralNet(object):
def __init__(self, units_per_layer):
self.W_and_b = []
self.units_per_layer = units_per_layer
self.nb_hidden_layers = len(units_per_layer) - 2
# each layer can have a different activation function
self.activations = []
for i in range(self.nb_hidden_layers):
self.activations.append(np.tanh)
self.activations.append(logistic_function)
for i in range(len(units_per_layer)-1):
# random initialization see [Glo10]
s = np.sqrt(6./(units_per_layer[i] + units_per_layer[i+1]))
W = np.random.uniform(size=(units_per_layer[i], units_per_layer[i+1]), low=-s, high=s)
bias = np.zeros([units_per_layer[i+1]])
self.W_and_b.append((W, bias))
def forward_pass(self, x):
zs = []
zs.append(None)
aa = []
a = x
aa.append(a)
for i, (W, bias) in enumerate(self.W_and_b):
z = np.dot(a, W) + bias
zs.append(z)
a = self.activations[i](z)
aa.append(a)
return zs, aa
def forward_backward_pass(self, train_example):
x, t = train_example
# first we do a forward pass to get all zs and all activations
zs, aa = self.forward_pass(x)
# backward pass:
# the output is the activation of the last layer
o = aa[-1]
# errors of the outputs:
deltas = []
# we use the "natural pairing" between cost function and
# output activation function, so the errors are
# just the difference between output activations and the targets
delta = (t-o)
deltas.append(delta)
for i in range(self.nb_hidden_layers, 0, -1):
W, b = self.W_and_b[i]
#the errors of the layer below are given by
delta = np.dot(delta, W.T)
# note the derivative of d(tanh(x))/dx = 1 - tanh^2(x)
derivative_tanh = 1 - (np.tanh(zs[i]))**2
delta = delta * derivative_tanh
deltas.append(delta)
deltas.append(None)
deltas = list(reversed(deltas))
return deltas, aa
def get_partial_derivatives(self, train_example):
deltas, activations = self.forward_backward_pass(train_example)
dW = []
db = []
for i, (W, b) in enumerate(self.W_and_b):
dW.append(np.outer(activations[i] , deltas[i+1]))
db.append(deltas[i+1])
return dW, db
# for checking that the gradient is computed correctly
# should be done as a test in a real implementation
def get_numerical_derivatives(train_examples):
raise NotImplementedError() #TODO
# simple random online learning
def train(self, nb_epochs, train_data, learning_rate):
for i in range(nb_epochs):
for j in range(len(train_data)):
index = np.random.random_integers(0, len(train_data)-1)
dW, db = self.get_partial_derivatives(train_data[index])
dWn, dbn = self.get_partial_derivatives(train_data[index])
for i, (W, b) in enumerate(self.W_and_b):
W -= learning_rate * dW[i]
b -= learning_rate * db[i]
if __name__ == '__main__':
# 2 input neurons
# 4 hidden (only one hidden layer)
# 1 output unit
nn = FeedForwardNeuralNet([2, 4, 1])
# simple x-or example
xy_0 = np.array([0,0]), np.array([0])
xy_1 = np.array([1,0]), np.array([1])
xy_2 = np.array([0,1]), np.array([1])
xy_3 = np.array([1,1]), np.array([0])
train_data = [xy_0, xy_1, xy_2, xy_3]
nb_epochs = 1000
nn.train(nb_epochs, train_data, 1.1)
for d in train_data:
dummy, aa = nn.forward_pass(d[0])
print aa[-1] , " approx " , d[1]
# Generate training data
# Polar coordinates: r, phi
def get_x(nb, c):
assert (c==0 or c==1)
r = c + np.random.rand(nb)
phi = np.random.rand(nb) * 2 * np.pi
return np.concatenate(((r * np.sin(phi)).reshape(-1,1), (r * np.cos(phi)).reshape(-1,1)), axis=1)
def get_data(nb_1, nb_2):
x_0 = get_x(nb_1, 0)
x_1 = get_x(nb_2, 1)
X = np.concatenate((x_0, x_1), axis=0)
t = np.zeros(len(X))
t[len(x_0):] = 1
p = np.random.permutation(range(len(X)))
# permute the data
X = X[p]
t = t[p]
return X, t
X_train, t_train = get_data(30, 35)
X_test, t_test = get_data(20, 20)
%matplotlib inline
import matplotlib.pyplot as plt
def plot_train_data(x_0, x_1):
plt.figure(figsize=(7,7))
plt.scatter(x_0[:,0], x_0[:,1], label='Class 0', color='b')
plt.scatter(x_1[:,0], x_1[:,1], label='Class 1', color='r')
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
ra = np.arange(-1,1.01,0.01, dtype=np.float64)
ci = np.ndarray(len(ra))
ci[:-1] = (np.sqrt(1. - ra[:-1]**2))
ci[-1] = 0. # hack
plt.plot(ra, ci,'g-')
plt.plot(ra, -ci,'g-')
plt.xlim(-2.,2.)
plt.ylim(-2.,2.)
plot_train_data(X_train[t_train==0], X_train[t_train==1])
nn = FeedForwardNeuralNet([2, 4, 1])
nb_epochs = 1000
train_data = np.array(zip(X_train, t_train))
nn.train(nb_epochs, train_data, 1.1)
test_data = np.array(zip(X_test, t_test))
i=0
for d in test_data:
dummy, aa = nn.forward_pass(d[0])
#print bool(d[1])
if bool(aa[-1][0]>0.5) == bool(d[1]):
i +=1
#print "prediction: ", aa[-1] , " \t - true value" , d[1]
print i, " of ", len(test_data), " test data correct classified."