LSTM (Long Short Term Memory)

Learning long-range dependencies with backpropagation-through-time is extremly difficult because of the vanishing gradient problem [H91]. With an Elman net we failed to handle the long range dependency of the embedded reber grammar.

LSTM was introduced by [SH97] to mitigating the vanishing gradient problem [H91]. Later Peephole Connections and Forget Units are added to the original LSTM algorithm for further improvement [GSC00] [FG01].

In an lstm network there are three different gates (input, output and forget gate) for controlling memory cells and their visibility:

The memory cell state is determined by the current input $\vec x(t) + \vec h(t-1)$ and the previous cell state $c(t-1)$ gated by the input gate respectivly the forget gate. The memory block output $h(t)$ is given by the transformed current cell state $c(t)$ gated by the output gate. With the peephole connections the memory cell influences the states of the gates, too.

A set of gates can control more then one cell unit. Such an composition of three gates with cell units is called a memory block.

For full details of lstm see [FG01].

The following graph shows such a lstm memory block.

LSTM

The blue arrows are the peephole connections. So the gates "see" the cell state(s) even if the output gate is closed.

Implementation with theano

In the following a memory block has only one memory cell. So all cell (and gate) states of the complete hidden layer can be written as a vector $\vec c_t$.

Then the forward pass formulars for LSTM are ($t$ is now an index as usual):

Input gates: $$ \vec i_t = \sigma( \vec x_t W_{xi} + \vec h_{t-1} W_{hi} + \vec c_{t-1} W_{ci} + \vec b_i) $$

Forget gates: $$ \vec f_t = \sigma (\vec x_t W_{xf} + \vec h_{t-1} W_{hf} + \vec c_{t-1} W_{cf} + \vec b_f) $$

Cell units: $$ \vec c_t = \vec f_t \circ \vec c_{t-1} + \vec i_t \circ \tanh(\vec x_t W_{xc} +\vec h_{t-1} W_{hc} + \vec b_c) $$

Output gates: $$ \vec o_t = \sigma(\vec x_t W_{xo}+ \vec h_{t-1} W_{ho} + \vec c_t W_{co} + \vec b_o) $$

The hidden activation (output of the cell) is also given by a product of two terms:

$$ \vec h_t = \vec o_t \circ \tanh (\vec c_t) $$

'$\circ$' is the Hadarmard product (element-wise multiplication).


Note that formulars (3) and (5) contain products between (neuron) states. Therefore LSTM networks are second-order recurrent neural networks.

In [17]:
import numpy as np
import theano
import theano.tensor as T

dtype=theano.config.floatX
In [18]:
# squashing of the gates should result in values between 0 and 1
# therefore we use the logistic function
sigma = lambda x: 1 / (1 + T.exp(-x))


# for the other activation function we use the tanh
act = T.tanh

# sequences: x_t
# prior results: h_tm1, c_tm1
# non-sequences: W_xi, W_hi, W_ci, b_i, W_xf, W_hf, W_cf, b_f, W_xc, W_hc, b_c, W_xy, W_hy, W_cy, b_y
def one_lstm_step(x_t, h_tm1, c_tm1, W_xi, W_hi, W_ci, b_i, W_xf, W_hf, W_cf, b_f, W_xc, W_hc, b_c, W_xy, W_ho, W_cy, b_o, W_hy, b_y):
    i_t = sigma(theano.dot(x_t, W_xi) + theano.dot(h_tm1, W_hi) + theano.dot(c_tm1, W_ci) + b_i)
    f_t = sigma(theano.dot(x_t, W_xf) + theano.dot(h_tm1, W_hf) + theano.dot(c_tm1, W_cf) + b_f)
    c_t = f_t * c_tm1 + i_t * act(theano.dot(x_t, W_xc) + theano.dot(h_tm1, W_hc) + b_c) 
    o_t = sigma(theano.dot(x_t, W_xo)+ theano.dot(h_tm1, W_ho) + theano.dot(c_t, W_co)  + b_o)
    h_t = o_t * act(c_t)
    y_t = sigma(theano.dot(h_t, W_hy) + b_y) 
    return [h_t, c_t, y_t]

For simplicity we use the same weight initialization scheme as for the simple recurrent neural networks.

In [19]:
#TODO: Use a more appropriate initialization method
def sample_weights(sizeX, sizeY):
    values = np.ndarray([sizeX, sizeY], dtype=dtype)
    for dx in xrange(sizeX):
        vals = np.random.uniform(low=-1., high=1.,  size=(sizeY,))
        #vals_norm = np.sqrt((vals**2).sum())
        #vals = vals / vals_norm
        values[dx,:] = vals
    _,svs,_ = np.linalg.svd(values)
    #svs[0] is the largest singular value                      
    values = values / svs[0]
    return values  
In [20]:
n_in = 7 # for embedded reber grammar
n_hidden = n_i = n_c = n_o = n_f = 10
n_y = 7 # for embedded reber grammar

# initialize weights
# i_t and o_t should be "open" or "closed"
# f_t should be "open" (don't forget at the beginning of training)
# we try to archive this by appropriate initialization of the corresponding biases 

W_xi = theano.shared(sample_weights(n_in, n_i))  
W_hi = theano.shared(sample_weights(n_hidden, n_i))  
W_ci = theano.shared(sample_weights(n_c, n_i))  
b_i = theano.shared(np.cast[dtype](np.random.uniform(-0.5,.5,size = n_i)))
W_xf = theano.shared(sample_weights(n_in, n_f)) 
W_hf = theano.shared(sample_weights(n_hidden, n_f))
W_cf = theano.shared(sample_weights(n_c, n_f))
b_f = theano.shared(np.cast[dtype](np.random.uniform(0, 1.,size = n_f)))
W_xc = theano.shared(sample_weights(n_in, n_c))  
W_hc = theano.shared(sample_weights(n_hidden, n_c))
b_c = theano.shared(np.zeros(n_c, dtype=dtype))
W_xo = theano.shared(sample_weights(n_in, n_o))
W_ho = theano.shared(sample_weights(n_hidden, n_o))
W_co = theano.shared(sample_weights(n_c, n_o))
b_o = theano.shared(np.cast[dtype](np.random.uniform(-0.5,.5,size = n_o)))
W_hy = theano.shared(sample_weights(n_hidden, n_y))
b_y = theano.shared(np.zeros(n_y, dtype=dtype))

c0 = theano.shared(np.zeros(n_hidden, dtype=dtype))
h0 = T.tanh(c0)

params = [W_xi, W_hi, W_ci, b_i, W_xf, W_hf, W_cf, b_f, W_xc, W_hc, b_c, W_xo, W_ho, W_co, b_o, W_hy, b_y, c0]
In [21]:
#first dimension is time

#input 
v = T.matrix(dtype=dtype)

# target
target = T.matrix(dtype=dtype)
In [22]:
# hidden and outputs of the entire sequence
[h_vals, _, y_vals], _ = theano.scan(fn=one_lstm_step, 
                                  sequences = dict(input=v, taps=[0]), 
                                  outputs_info = [h0, c0, None ], # corresponds to return type of fn
                                  non_sequences = [W_xi, W_hi, W_ci, b_i, W_xf, W_hf, W_cf, b_f, W_xc, W_hc, b_c, W_xo, W_ho, W_co, b_o, W_hy, b_y] )

We use the embedded reber grammar data set, so the cost function is given by:

In [23]:
cost = -T.mean(target * T.log(y_vals)+ (1.- target) * T.log(1. - y_vals))
In [24]:
# learning rate
lr = np.cast[dtype](.1)
learning_rate = theano.shared(lr)
In [25]:
gparams = []
for param in params:
  gparam = T.grad(cost, param)
  gparams.append(gparam)

updates=[]
for param, gparam in zip(params, gparams):
    updates.append((param, param - gparam * learning_rate))

If not already done, copy the reber grammar code in a file reberGrammar.py and put it on your python path.

In [26]:
import reberGrammar
train_data = reberGrammar.get_n_embedded_examples(1000)
In [27]:
learn_rnn_fn = theano.function(inputs = [v, target],
                               outputs = cost,
                               updates = updates)
In [28]:
nb_epochs=250
train_errors = np.ndarray(nb_epochs)
def train_rnn(train_data):      
  for x in range(nb_epochs):
    error = 0.
    for j in range(len(train_data)):  
        index = np.random.randint(0, len(train_data))
        i, o = train_data[index]
        train_cost = learn_rnn_fn(i, o)
        error += train_cost
    train_errors[x] = error 
    
train_rnn(train_data)

Let's plot the error over the epochs:

In [29]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.arange(nb_epochs), train_errors, 'b-')
plt.xlabel('epochs')
plt.ylabel('error')
plt.ylim(0., 50)
Out[29]:
(0.0, 50)

Mostly you can see a saturation before the error goes down again. Here about after 70 epochs. The simple recurrent net has reached an error of about 12 with the embedded reber grammar data set. So we can assume that the step down after the 70 epochs is because the net learnt the long range dependency.

TODO: There are peaks where the error goes up - try clipping the gradient as remedy; see [Pas13]

Prediction

We need a new theano function for prediction.

So we can check, if the second to last target is correct. This is the long range dependency.

In [30]:
predictions = theano.function(inputs = [v], outputs = y_vals)

test_data = reberGrammar.get_n_embedded_examples(10)

def print_out(test_data):
    for i,o in test_data:
        p = predictions(i)
        print o[-2] # target
        print p[-2] # prediction
        print 
print_out(test_data)
[ 0.  0.  0.  0.  1.  0.  0.]
[  7.26549433e-06   2.44575203e-03   5.85505290e-07   4.59309916e-07
   9.95132625e-01   3.76047287e-03   3.42162419e-03]

[ 0.  0.  0.  0.  1.  0.  0.]
[  7.30976581e-06   3.41998553e-03   1.59438639e-06   1.16924286e-06
   9.94966328e-01   1.29682245e-03   3.50598316e-03]

[ 0.  0.  0.  0.  1.  0.  0.]
[  3.07976029e-06   7.21112592e-03   3.46183424e-06   2.73233741e-06
   9.96156871e-01   2.94243160e-04   3.57970735e-03]

[ 0.  1.  0.  0.  0.  0.  0.]
[  1.17116624e-05   9.98752773e-01   3.77782708e-05   3.19207174e-05
   3.70856514e-03   3.65254232e-06   4.80283936e-03]

[ 0.  1.  0.  0.  0.  0.  0.]
[  1.11635291e-05   9.98695910e-01   4.36177470e-05   3.74880074e-05
   3.15373507e-03   5.92770402e-06   5.19890618e-03]

[ 0.  1.  0.  0.  0.  0.  0.]
[  1.11324853e-05   9.99102592e-01   4.57182359e-05   3.90066998e-05
   2.82593514e-03   3.93977234e-06   4.78830189e-03]

[ 0.  1.  0.  0.  0.  0.  0.]
[  1.01509959e-05   9.98987198e-01   4.59799594e-05   3.90715904e-05
   3.92509345e-03   3.64733501e-06   4.43072896e-03]

[ 0.  1.  0.  0.  0.  0.  0.]
[  1.20873447e-05   9.98566985e-01   4.15420727e-05   3.53414507e-05
   3.07346112e-03   3.69753980e-06   5.77724120e-03]

[ 0.  0.  0.  0.  1.  0.  0.]
[  6.53217785e-06   1.82499678e-03   1.89035802e-06   1.41418468e-06
   9.95668828e-01   1.42531423e-03   4.86484962e-03]

[ 0.  0.  0.  0.  1.  0.  0.]
[  3.69720124e-06   3.50134331e-03   2.42282613e-06   1.92909306e-06
   9.96156991e-01   6.05587207e-04   4.49846964e-03]


If you compare the predictions with the target values you see that the long range dependency was learnt correctly.