Feedforward Neural Networks in a Nutshell

Here you just find a short summary (math and implementation with python) of the forward and backward pass of feedforward neural networks.

If you a new to neural networks first read the first two chapters of the online book Neural Network and Deep Learning or the first chapters from a neural networks text book (see the literature list below). For a less dense derivation of backpropagation and a tutorial implementation of neural networks with python see the Hacker's guide to Neural Networks.


The activities of the layers can be written in vector form:

  • Input layer: $\vec{x} = \vec{a}^{(1)}$ with elements $x_i = a^{(1)}_{i}$
  • Hidden layers: $\vec{a}^{(l)}$ with $ 1 < l < L $
  • Output layer: $\vec{o} = \vec a^{(L)}$

with $1 \leq i \leq n^{(l)}$ with $n^{(l)}$ neurons in the layer $l$. The number of units $n^{(l)}$ can be different for each layer.

Each layer $l$ (except the output layer $l=L$) has a bias unit $a_0^{(l)} = 1$.

Forward Pass

The activities of the neurons -except for the input layer- are determined by the activity vector of the layer below.

For a full connected neural network for each unit $i$ the weighted sum of the activities of the layer below is:

$$ z_i^{(l+1)} = \sum_{j=1}^{n} w_{ij}^{(l)} a_j^{(l)} + b_{j} $$

In vector-matrix form (affine transformation):

$$ \vec z^{(l+1)} = W^{(l)} \cdot \vec a^{(l)} + \vec b^{(l)} $$

All vectors are column vectors.

Note: That can equivalently written with row vectors $\vec v'$:

$$ \vec z'^{(l+1)} = \vec a'^{(l)} \cdot W'^{(l)} + \vec b'^{(l)} $$

$W'$ is the transpose of $W$, i.e. $w_{ij}=w'_{ji}$ with:

  • $i$: Index of the next layer
  • $j$: Index of the previous layer

Activation function

The activations $\vec a^{(l)}$ for each layer (except the input) are computed from $\vec z^{(l)}$ by application of an activation function $g^{(l)}(\vec z^{(l)})$ (element-wise).

$$ \vec a^{(l+1)} = g^{(l+1)}(\vec z^{(l+1)}) = g^{(l+1)}(W^{(l)} \cdot \vec a^{(l)} + \vec b^{(l)}) $$

Typically $a(.)$ is a non-linear function.


From the difference between the target value $\vec y^{(m)}$ and the net output $\vec o^{(m)}$ a cost function $E^{(m)}$ for each training example $m$ can be computed, e.g. cross-entropy error. The total cost of all training examples is given by

$$ E = \sum_{m} E^{(m)} $$

The goal of the training procedure is to minimize the training error. To archive this goal the parameters (connection weights) should be modified.

We write for all parameters for a layer $\theta_{ij}^{(l)}$ (instead of $w_{ij}^{(l)}$ and the bias $b_j^{(l)}$ - see below)

For small differences $\Delta E$ this can be approximated by:

$$ \Delta E \approx \sum \frac{\partial E}{\partial \theta^{(l)}_{ij}} \Delta \theta^{(l)}_{ij} $$

To train a neural network with (stocastic) gradient decent (SGD) the input $\vec x^{(m)}$ of a (random) training example $(\vec x^{(m)}, \vec y^{(m)})$ is feed in the input layer.

Then after computing all ${\partial E^{(m)}}/{\partial \theta^{(l)}_{ij}}$ the weights are modified by

$$ \theta^{(l)}_{ij} \leftarrow \theta^{(l)}_{ij} - \alpha \frac{\partial E^{(m)}}{\partial \theta^{(l)}_{ij}} $$

$\alpha$ is a hyperparameter of the training procedure, called the learning rate.

The the following we drop the index $(m)$ of $E$. (The same holds for batch and mini-batch learning).

With the famous backpropagation algorithm it's possible to compute efficiently all ${\partial E}/{\partial \theta^{(l)}_{ij}}$.

Backward pass

For mathematical convenience $w_{ij}^{(l)}$ and the bias $b_j^{(l)}$ are written together as $\theta_{ij}^{(l)}$ with $\theta_{0j}^{(l)} = b_{j}^{(l)}$ and defining $a_0^{(l)}:=1$:

$$ z_i^{(l+1)} = \sum_{j=0}^{n} \theta_{ij}^{(l)} a_j^{(l)} $$
  • $i \in \{1, \dots n^{(l+1)}\}$
  • $j \in \{0, \dots n^{(l)}\}$

In vector-matrix form ($\vec a^{(l)}$ has now an additional element $a_0^{(l)}:=1$): $$ \vec z^{(l+1)} = \Theta^{(l)} \vec a^{(l)} $$

Recap: Chain rule

If $a$ is a function of $b$ this can be written as $a = a(b)$. If $b$ is also function of another variable $c$ therefore $a$ is indirectly also a function of $c$ (transitivity).

The following relation holds for the derivatives: $$ \frac{d a(b(c))}{dc} = \frac{d a(b)}{db} \frac{db(c)}{dc} $$

Analog for partial derivatives: $$ \frac{\partial a(\vec b(\vec c))}{\partial c_i} = \sum_j \frac{\partial a}{\partial b_j} \frac{\partial b_j}{\partial c_i} $$

Last Layer

To compute the derivates of the costs of the $\theta$'s, note that $E$ is indirectly dependent on $\theta$ over $\vec o$

$$ E(\theta) = E(\vec o(\theta)) $$$$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial o_i} \frac{\partial o_i}{\partial \theta^{(L-1)}_{ij}} $$

with the output activation $f(.)=g^{(L)}(.)$ $$ o_i = f(z_i^{(L)}) = f(\sum_j \theta^{(L-1)}_{ij} a_j^{(L-1)}) $$

and application of the chain rule: $$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial o_i} \frac{\partial f(z_i^{(L)})}{\partial z_i^{(L)}} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-1)}_{ij}} = \frac{\partial E}{\partial z_i^{(L)}} a_j^{(L-1)} $$

Using the corresponding activation function $f$ for the cost function this simplifies to (see e.g.[Bi95]):

$$ \frac{\partial E}{\partial \theta^{(L-1)}_{ij}} = (o_i - t_i) a_j^{(L-1)} $$

e.g. for squared-error loss ($E^{(m)} = 1/2 (o_i-t_i)^2$) and the identity (activation) function:

$$ \frac{\partial E}{\partial z_i^{(L)}} = \frac{\partial E}{\partial o_i} \frac{\partial f(z_i^{(L)})}{\partial z_i^{(L)}} = 1/2 \frac{\partial (o_i-t_i)^2}{\partial o_i} \frac{\partial z_i^{(L)}}{\partial z_i^{(L)}} = (o_i-t_i) $$

The partial derivative of $E$ to $z_i$ is called error (responsibility): $$ \delta_i^{(L)} = \frac{\partial E}{\partial z_i^{(L)}} $$

Second Last Layer

Note that the function for computing an output unit $o_i$ dependent on the corresponding weight $\theta^{(L-2)}_{jk}$ is:

$$ o_i = f(\sum_j \theta^{(L-1)}_{ij} a_j^{(L-1)}) = f(\sum_j \theta^{(L-1)}_{ij} g(z_j^{(L-1)})) = f(\sum_j \theta^{(L-1)}_{ij} g(\sum_k \theta^{(L-2)}_{jk} a_k^{(L-2)})) $$

The cost (as function of $\theta^{(L-2)}_{jk}$) is in general dependent on all output units. The chain rule tells us that we must sum over all of them:

$$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \sum_i \frac{\partial E}{\partial z_i^{(L)}} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} $$

With the errors $\delta_i^{(L)}$ (see above): $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \sum_i \delta_i^{(L)} \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} $$

Note that we can reuse the values of errors $\delta_i^{(L)}$ which we have already computed.

Therefore we just need to compute the second factor: $$ \frac{\partial z_i^{(L)}}{\partial \theta^{(L-2)}_{jk}} = \frac{\partial z_i^{(L)}}{\partial g(z_j^{(L-1)})} \frac{\partial g(z_j^{(L-1)})}{\partial \theta^{(L-2)}_{jk}} = $$ $$ \frac{\partial g(\theta_{ij}^{(L-1)} z_j^{(L-1)})} {\partial g(z_j^{(L-1)})} \frac{\partial g(z_j^{(L-1)})}{\partial z_j^{(L-1)}} \frac{\partial z_j^{(L-1)}}{\partial \theta^{(L-2)}_{jk}} = $$ $$ \theta_{ij}^{(L-1)} g'(z_j^{(L-1)}) \frac{\partial (a_k^{(L-2)} \theta^{(L-2)}_{jk})}{\partial \theta^{(L-2)}_{jk}}= $$ $$ \theta_{ij}^{(L-1)} g'(z_j^{(L-1)}) a_k^{(L-2)} $$

So we have: $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = a_k^{(L-2)} g'(z_j^{(L-1)})\sum_i \delta_i^{(L)} \theta_{ij}^{(L-1)} $$

Again the error of unit $j$ is defined as $$ \delta_j^{(L-1)} = \frac{\partial E}{\partial z_j^{(L-1)}} = g'(z_j^{(L-1)}) \sum_i \delta_i^{(L)} \theta_{ij}^{(L-1)} $$

This definition holds for the errors of all hidden units. The error of an hidden layer can be computed with the errors of the layer above. This can be interpreted as the propagation of an "error signal" back through the network.

Note that we reuse the new errors $\delta_j^{(L-1)}$ again for deriving the partial derivatives for the layer $L-2$ and so on: $$ \frac{\partial E}{\partial \theta^{(L-2)}_{jk}} = \frac{\partial E}{\partial z_j^{(L-1)}} \frac{\partial z_j^{(L-1)}}{\partial \theta^{(L-2)}_{jk}} = \delta_j^{(L-1)} a_k^{(L-2)} $$


  • The error backpropagation is 'linear' in the $\delta$'s. No squashing by an activation function is done in contrast to the forward pass. So the error is attempted to explode or to vanish (much more than the forward signal). That's the so called vanishing gradient problem [Hoch91].
    • Intuition: If you have an squared weight matrix and you are multiplying it on a vector again and again (c.f. simple recurrent neural networks) the vector explodes or vanishes depending on the size of the largest eigenvalue of the matrix. This argument also holds despite of the additional factor $g'(z_j^{(L-1)})$ (for a more formal derivation see [Pas13]).
  • To compute the errors we multiply with $g'(z_j^{(L-1)})$. So if we use an activation function which saturates (sigmoid, tanh) the error will become very small due to this multiplication.

Vector-Matrix formulation

With the Hadarmard product '$\circ$' (element-wise multiplication) this can be written in matrix-vector form: $$ \vec \delta^{(l)} = g'(\vec z^{(l)}) \circ ( (\Theta^{(l)})^T \cdot \vec \delta^{(l+1)} ) $$

The error backpropagation can be exploited to compute efficiently the derivates $\partial E / \partial \theta $ of all layers.

The derivatives of the cost function can now be computed for a layer as a matrix $\frac{\partial E}{\partial \theta_{ij}^{(l)}}$:

$$ \frac{\partial E}{\partial \Theta^{(l)}} = \vec a^{(l)} \otimes \vec \delta^{(l+1)} $$

'$\otimes$' represents the outer product.

So with the back-propagated errors and the forward-propagated activations the learning rule becomes: $$ \Theta^{(l)} \leftarrow \Theta^{(l)} - \alpha (\vec a^{(l)} \otimes \vec \delta^{(l+1)}) $$

Demonstration implementation of a feedforward neural network

  • Classification with possible target values $t \in \{0, 1\}$
In [82]:
In [104]:
import numpy as np

def logistic_function(x):
    return 1./(1 + np.exp(x))

def cross_entropy_loss(t, o):
    return - (t * np.log(o) + (1 - t) * np.log(1 - o)).sum()
class FeedForwardNeuralNet(object):
    def __init__(self, units_per_layer):
        self.W_and_b = []
        self.units_per_layer = units_per_layer
        self.nb_hidden_layers = len(units_per_layer) - 2     
        # each layer can have a different activation function     
        self.activations = []    
        for i in range(self.nb_hidden_layers):
        for i in range(len(units_per_layer)-1):
        	# random initialization see [Glo10]
            s = np.sqrt(6./(units_per_layer[i] + units_per_layer[i+1]))
            W = np.random.uniform(size=(units_per_layer[i], units_per_layer[i+1]), low=-s, high=s) 
            bias = np.zeros([units_per_layer[i+1]]) 
            self.W_and_b.append((W, bias))
    def forward_pass(self, x):
        zs = []
        aa = []
        a = x
        for i, (W, bias) in enumerate(self.W_and_b):  
            z = np.dot(a, W) + bias 
            a = self.activations[i](z)
        return zs, aa          
    def forward_backward_pass(self, train_example):
        x, t = train_example
        # first we do a forward pass to get all zs and all activations   
        zs, aa = self.forward_pass(x)
        # backward pass:
        # the output is the activation of the last layer
        o = aa[-1]
        # errors of the outputs:
        deltas = []
        # we use the "natural pairing" between cost function and 
        # output activation function, so the errors are 
        # just the difference between output activations and the targets
        delta = (t-o)
        for i in range(self.nb_hidden_layers, 0, -1):
            W, b = self.W_and_b[i]
            #the errors of the layer below are given by
            delta =  np.dot(delta, W.T)
            # note the derivative of d(tanh(x))/dx = 1 - tanh^2(x)
            derivative_tanh = 1 - (np.tanh(zs[i]))**2
            delta = delta * derivative_tanh
        deltas = list(reversed(deltas))
        return deltas, aa
    def get_partial_derivatives(self, train_example):
        deltas, activations = self.forward_backward_pass(train_example)
        dW = []
        db = []
        for i, (W, b) in enumerate(self.W_and_b):
            dW.append(np.outer(activations[i] , deltas[i+1]))
        return dW, db 
    # for checking that the gradient is computed correctly
    # should be done as a test in a real implementation
    def get_numerical_derivatives(train_examples):
        raise NotImplementedError() #TODO
    # simple random online learning     
    def train(self, nb_epochs, train_data, learning_rate):
        for i in range(nb_epochs):
            for j in range(len(train_data)):
                index = np.random.random_integers(0, len(train_data)-1)
                dW, db = self.get_partial_derivatives(train_data[index])
                dWn, dbn = self.get_partial_derivatives(train_data[index])
                for i, (W, b) in enumerate(self.W_and_b):
                    W -= learning_rate * dW[i]
                    b -= learning_rate * db[i]

if __name__ == '__main__':
    # 2 input neurons
    # 4 hidden (only one hidden layer)
    # 1 output unit
    nn = FeedForwardNeuralNet([2, 4, 1])
    # simple x-or example
    xy_0 =  np.array([0,0]), np.array([0])
    xy_1 =  np.array([1,0]), np.array([1])
    xy_2 =  np.array([0,1]), np.array([1])
    xy_3 =  np.array([1,1]), np.array([0])
    train_data = [xy_0, xy_1, xy_2, xy_3]

    nb_epochs = 1000
    nn.train(nb_epochs, train_data, 1.1)

    for d in train_data:
       dummy, aa = nn.forward_pass(d[0])
       print aa[-1] ,  " approx " , d[1]
[ 0.00056098]  approx  [0]
[ 0.99927238]  approx  [1]
[ 0.99930197]  approx  [1]
[ 0.00035623]  approx  [0]

Test on an other simple data set:

In [105]:
# Generate training data 
# Polar coordinates: r, phi
def get_x(nb, c):
    assert (c==0 or c==1)
    r = c + np.random.rand(nb)
    phi = np.random.rand(nb) * 2 * np.pi
    return np.concatenate(((r * np.sin(phi)).reshape(-1,1), (r * np.cos(phi)).reshape(-1,1)), axis=1)
In [106]:
def get_data(nb_1, nb_2):
    x_0 = get_x(nb_1, 0)
    x_1 = get_x(nb_2, 1)
    X = np.concatenate((x_0, x_1), axis=0)
    t = np.zeros(len(X))
    t[len(x_0):] = 1
    p = np.random.permutation(range(len(X)))
    # permute the data
    X = X[p]
    t = t[p]
    return X, t

X_train, t_train = get_data(30, 35)
X_test, t_test = get_data(20, 20)
In [107]:
%matplotlib inline
import matplotlib.pyplot as plt
def plot_train_data(x_0, x_1):
    plt.scatter(x_0[:,0], x_0[:,1], label='Class 0', color='b') 
    plt.scatter(x_1[:,0], x_1[:,1], label='Class 1', color='r') 
    ra = np.arange(-1,1.01,0.01, dtype=np.float64)
    ci = np.ndarray(len(ra))
    ci[:-1] = (np.sqrt(1. - ra[:-1]**2))
    ci[-1] = 0. # hack 
    plt.plot(ra, ci,'g-')
    plt.plot(ra, -ci,'g-')
In [108]:
plot_train_data(X_train[t_train==0], X_train[t_train==1])
In [111]:
nn = FeedForwardNeuralNet([2, 4, 1])
nb_epochs = 1000
train_data = np.array(zip(X_train, t_train))
nn.train(nb_epochs, train_data, 1.1)
In [112]:
test_data = np.array(zip(X_test, t_test))
for d in test_data:
    dummy, aa = nn.forward_pass(d[0])
    #print bool(d[1])
    if bool(aa[-1][0]>0.5) == bool(d[1]):
        i +=1
    #print "prediction: ", aa[-1] ,  " \t  -  true value" , d[1]
print i, " of ", len(test_data), " test data correct classified."    
36  of  40  test data correct classified.


General Neural Network Literature

  • [Bi95] C. Bishop: Neural Networks for Pattern Recognition, Oxford University Press 1995
  • S. Haykin: Neural Networks and Learning Machines (3rd Edition), Pearson, 2008
Vanishing gradient problem
  • [Hoch91] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991. Advisor: J. Schmidhuber.
  • [Pas13] Razvan Pascanu, Tomas Mikolov, Yoshua Bengio: On the difficulty of training Recurrent Neural Networks, ICML (3) 2013: 1310-1318

Weight initialization