Recurrent Neural Networks

Here we implement a simple recurrent neural network (Elman network [Elm]) with theano.

Characteristica of recurrent neural networks:

  • high-dimensional hidden state (distributed representation)
  • non-linear dynamics
  • RNN with standard linear connections are computationally as powerful as any turing machine [SiSo91]

Forward Pass

  • Sequence of input vectors: $\vec{x}_1, \dots , \vec{x}_T$
  • Hidden vectors: $\vec{h}_1, \dots , \vec{h}_T$
  • Output Vectors: $\vec{o}_1, \dots , \vec{o}_T$

The parameter of the recurrent neural network:

  • Input-hidden connection matrix: $W_{ih}$
  • Biases of the hidden neurons: $\vec{b}_h$
  • Hidden-hidden connection matrix: $W_{hh}$
  • Biases of the output neurons: $\vec{b}_o$
  • Hidden-output connection matrix: $W_{ho}$

The state of the hidden neurons at a time step $t$ is computed by: $$ \vec{h}_t = act_h (\vec{x}_{t} W_{ih} + \vec{h}_{t-1} W_{hh} + \vec{b}_h) $$

The state of the outputs is given by: $$ \vec{o}_t = act_o (\vec{h}_t W_{ho} + \vec{b}_o) $$


Remark: In the formulars above the vectors are row vectors. There is a little difference to the usual formulations where the vectors are column vectors. The reason is that in nearly all theano code examples a matrix-vector product is written as T.dot(input, self.W). Note that the following relation holds:

$$ (\vec x W_{xy})^T = W_{xy}^T \vec{x}^T = W_{yx} \vec{x}^T = W_{yz} \vec{z} $$ $\vec z$ is a column vector and $\vec x$ the corresponding row vector ($\vec z^T = \vec x$ ).

Enrolling in time

Rnn enrolled in Time

The hidden state at $t=0$ can also be learned and can be considered as a parameter of the rnn.

Backward Pass

Enrolling the recurrent neural network in time allows to apply gradient descent learning similar to feed forward neural networks.

Here we construct an expression graph with theano which takes care of computing the gradient and updating the weights.

Implementation with Theano

In [1]:
import numpy as np
import theano
import theano.tensor as T

dtype=theano.config.floatX

We use the reber grammar data (see reber grammar), so the dimension of the layers are:

In [2]:
import reberGrammar
n_in  = 7 
n_hid = 10
n_out = 7

To keep things simple we have just one example $\vec v$ for each time stept $t$. So all inputs can be written as a Matrix:

In [3]:
#first dimension is time
v = T.matrix(dtype=dtype) 

Because the vanishing and exploding gradient depends on the largest singular value (see [Pas13]), we
initialize the weights fixing the largest singular value (for the tanh) to 1 [similar in groundhog]:

In [4]:
def rescale_weights(values, factor=1.):
    factor = np.cast[dtype](factor)
    _,svs,_ = np.linalg.svd(values)
    #svs[0] is the largest singular value                      
    values = values / svs[0]
    return values

def sample_weights(sizeX, sizeY):
    values = np.ndarray([sizeX, sizeY], dtype=dtype)
    for dx in xrange(sizeX):
        vals = np.random.uniform(low=-1., high=1.,  size=(sizeY,))
        #vals_norm = np.sqrt((vals**2).sum())
        #vals = vals / vals_norm
        values[dx,:] = vals
    _,svs,_ = np.linalg.svd(values)
    #svs[0] is the largest singular value                      
    values = values / svs[0]
    return values
Parameter

The parameters of the RNN are defined as theano shared variables:

In [5]:
# parameters of the rnn as shared variables
def get_parameter(n_in, n_out, n_hid):
    b_h = theano.shared(np.zeros(n_hid, dtype=dtype)) 
    h0 = theano.shared(np.zeros(n_hid, dtype=dtype))
    W_ih = theano.shared(sample_weights(n_in, n_hid))
    W_hh = theano.shared(sample_weights(n_hid, n_hid))
    W_ho = theano.shared(sample_weights(n_hid, n_out))
    b_o = theano.shared(np.zeros(n_out, dtype=dtype)) 
    return W_ih, W_hh, b_h, W_ho, b_o, h0

W_ih, W_hh, b_h, W_ho, b_o, h0 = get_parameter(n_in, n_out, n_hid)   
params = [W_ih, W_hh, b_h, W_ho, b_o, h0]

Basis of the implementation with theano is the scan operator.

Scan takes a function which is computed at each iteration step.

In [6]:
# we could use the fact that maximaly two outputs are on, 
# but for simplicity we assume independent outputs:
def logistic_function(x):
    return 1./(1 + T.exp(-x))

# sequences: x_t
# prior results: h_tm1
# non-sequences: W_ih, W_hh, W_ho, b_h
def one_step(x_t, h_tm1, W_ih, W_hh, b_h, W_ho, b_o):
    h_t = T.tanh(theano.dot(x_t, W_ih) + theano.dot(h_tm1, W_hh) + b_h)
    y_t = theano.dot(h_t, W_ho) + b_o 
    y_t = logistic_function(y_t) 
    return [h_t, y_t]

Scan iterates through the sequences and compute a function at each iteration step. The four parameters (that we use) for scan are:

  • fn: The function which is computed at each time step
  • sequences: The sequence input to the function fn
  • outputs_info: Initialization of the iteration. We provide only h0. Must corresponds to the return type of fn, so we provide also None.
  • non_sequences: Variables that are used as additional input at each timestep.

Note that the parameter of the function fn (here one_step) are: sequences, prior results and non-sequences in exact this order.

In [7]:
# hidden and outputs of the entire sequence
[h_vals, o_vals], _ = theano.scan(fn=one_step, 
                                  sequences = dict(input=v, taps=[0]), 
                                  outputs_info = [h0, None], # corresponds to return type of fn
                                  non_sequences = [W_ih, W_hh, b_h, W_ho, b_o] )

The target values are also given as matrix:

In [8]:
# target values
target = T.matrix(dtype=dtype)

We learn the net with stochastic gradient decent with the learning rate as hyperparameter of the training procedure. Theano takes care of backpropagation through time.

In [9]:
# learning rate
lr = np.cast[dtype](0.2)
learning_rate = theano.shared(lr)

We do classification therefore the cost function is the mean of the cross entropy losses (for the independent-outputs simplification) for all time steps .

In [10]:
cost = -T.mean(target * T.log(o_vals) + (1.- target) * T.log(1. - o_vals))
In [11]:
def get_train_functions(cost, v, target):
    gparams = []
    for param in params:
        gparam = T.grad(cost, param)
        gparams.append(gparam)

    updates=[]
    for param, gparam in zip(params, gparams):
        updates.append((param, param - gparam * learning_rate))
    learn_rnn_fn = theano.function(inputs = [v, target],
                                   outputs = cost,
                                   updates = updates)
    return learn_rnn_fn
In [12]:
learn_rnn_fn = get_train_functions(cost, v, target)
/usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:114: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  stop in [None, length, maxsize] or
/usr/local/lib/python2.7/dist-packages/theano/tensor/opt.py:2165: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if (replace_x == replace_y and
/usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:110: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  start in [None, 0] or
/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Let's generate some training data. Here we use the reberGrammar (you must copy the code in a file reberGrammar.py for the import!):

In [13]:
import reberGrammar
train_data = reberGrammar.get_n_examples(1000)
In [14]:
def train_rnn(train_data, nb_epochs=50):      
  train_errors = np.ndarray(nb_epochs)
  for x in range(nb_epochs):
    error = 0.
    for j in range(len(train_data)):  
        index = np.random.randint(0, len(train_data))
        i, o = train_data[index]
        train_cost = learn_rnn_fn(i, o)
        error += train_cost
    train_errors[x] = error
  return train_errors

nb_epochs=10
train_errors = train_rnn(train_data, nb_epochs)
In [15]:
%matplotlib inline
import matplotlib.pyplot as plt
def plot_learning_curve(train_errors):
    plt.plot(np.arange(nb_epochs), train_errors, 'b-')
    plt.xlabel('epochs')
    plt.ylabel('error')
plot_learning_curve(train_errors)

Predictions

For the predictions we need to compile a new function.

In [16]:
predictions = theano.function(inputs = [v], outputs = o_vals)
In [17]:
inp, outp = reberGrammar.get_one_example(10)
pre = predictions(inp)
for p, o in zip(pre, outp):
    print p # prediction
    print o # target
    print
[ 0.00456713  0.99421883  0.00498617  0.00396112  0.99583906  0.00794847
  0.00271153]
[ 0.  1.  0.  0.  1.  0.  0.]

[  3.89091292e-04   9.92518902e-01   8.59818608e-03   8.50704685e-03
   5.94907766e-03   9.97637510e-01   1.13319975e-05]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.48272001e-04   9.96478260e-01   2.50943843e-03   2.77127395e-03
   1.83269428e-03   9.99300718e-01   1.23795908e-05]
[ 0.  1.  0.  0.  0.  1.  0.]

[  3.56875709e-04   5.44762611e-03   1.35493829e-05   1.37530033e-05
   9.91696715e-01   9.97204721e-01   4.37155832e-03]
[ 0.  0.  0.  0.  1.  1.  0.]

[  1.23524433e-03   2.56308727e-03   9.94648755e-01   9.94855106e-01
   3.11062729e-04   2.18423526e-03   3.83205223e-03]
[ 0.  0.  1.  1.  0.  0.  0.]

[  4.90578823e-04   9.98072445e-01   5.00442926e-04   5.66002680e-04
   1.31634716e-03   9.93887603e-01   1.19577569e-03]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.51081918e-04   9.97105837e-01   2.00509862e-03   2.11595721e-03
   2.79020611e-03   9.99550700e-01   9.12862015e-06]
[ 0.  1.  0.  0.  0.  1.  0.]

[  3.57442623e-04   5.33420872e-03   1.34936245e-05   1.37383076e-05
   9.91536856e-01   9.97191250e-01   4.47221892e-03]
[ 0.  0.  0.  0.  1.  1.  0.]

[  1.24084740e-03   2.54000421e-03   9.94790792e-01   9.94985580e-01
   3.12648015e-04   2.14149500e-03   3.80805437e-03]
[ 0.  0.  1.  1.  0.  0.  0.]

[  4.92787454e-04   9.98076558e-01   4.92200896e-04   5.56850282e-04
   1.31844566e-03   9.93862867e-01   1.22822251e-03]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.51108489e-04   9.97099459e-01   2.01760489e-03   2.12938944e-03
   2.78066448e-03   9.99547243e-01   9.13820986e-06]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.37004439e-04   9.97305036e-01   1.85600575e-03   1.99846854e-03
   2.59547005e-03   9.99582410e-01   8.67310246e-06]
[ 0.  1.  0.  0.  0.  1.  0.]

[  3.59748461e-04   5.50844055e-03   1.31341203e-05   1.33567864e-05
   9.91393149e-01   9.97298002e-01   4.52233106e-03]
[ 0.  0.  0.  0.  1.  1.  0.]

[  1.23822107e-03   2.55193049e-03   9.94808733e-01   9.95000362e-01
   3.15402023e-04   2.16433848e-03   3.73414974e-03]
[ 0.  0.  1.  1.  0.  0.  0.]

[  4.94612672e-04   9.98081505e-01   4.78766422e-04   5.41684916e-04
   1.33576209e-03   9.93895411e-01   1.25921855e-03]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.51182704e-04   9.97085392e-01   2.03460408e-03   2.14810786e-03
   2.76179821e-03   9.99541998e-01   9.17930538e-06]
[ 0.  1.  0.  0.  0.  1.  0.]

[  2.37012340e-04   9.97305930e-01   1.85609411e-03   1.99846085e-03
   2.59599206e-03   9.99582469e-01   8.67048038e-06]
[ 0.  1.  0.  0.  0.  1.  0.]

[  3.59748461e-04   5.50843822e-03   1.31341831e-05   1.33568756e-05
   9.91392851e-01   9.97298181e-01   4.52219183e-03]
[ 0.  0.  0.  0.  1.  1.  0.]

[  1.69217924e-03   1.45280137e-04   1.28758512e-02   1.32959578e-02
   1.39275165e-02   6.47724420e-03   9.83603418e-01]
[ 0.  0.  0.  0.  0.  0.  1.]


Sampling

In many cases it's interesting to generate sequences from recurrent neural networks. Let's sample from the trained rnn to produce reber grammar strings.

In [18]:
from theano.tensor.shared_randomstreams import RandomStreams
srng = RandomStreams(seed=234)

def get_sample(probs):
    return srng.multinomial(n=1, pvals=probs)    
In [19]:
# sequences: 
# prior results: h_tm1, y_tm1
# non-sequences: W_ih, W_hh, W_ho, b_h
def sampling_step(h_tm1, y_tm1, W_ih, W_hh, b_h, W_ho, b_o):
    h_t = T.tanh(theano.dot(y_tm1, W_ih) + theano.dot(h_tm1, W_hh) + b_h)
    y_t = theano.dot(h_t, W_ho) + b_o
    y_t = logistic_function(y_t)
    # normalize probabilities to all outputs
    y_t = y_t / T.sum(y_t)
    y_t = get_sample(y_t)
    y_t = T.cast(y_t, dtype)
    return [h_t, y_t], theano.scan_module.until(y_t[6]>.99)
In [20]:
# always start with B-Symbol
#y0 = T.constant(np.array([1,0,0,0,0,0,0], dtype=dtype))
y0 = theano.shared(np.array([1.,0.,0.,0.,0.,0.,0.], dtype=dtype))

# hidden and outputs of the entire sequence
[h_vals, y_vals], updates = theano.scan(fn=sampling_step, 
                                  sequences = [], 
                                  n_steps = 10000,
                                  outputs_info = [h0, y0], # corresponds to return type of fn
                                  non_sequences = [W_ih, W_hh, b_h, W_ho, b_o] )
In [35]:
samples = theano.function(inputs = [], outputs = y_vals, updates=updates)
# TODO get rid of this warning message!?
In [22]:
for i in range(20):
    sample_sequence = np.concatenate((np.array([[1.,0.,0.,0.,0.,0.,0.]], dtype='float32'), samples()), axis=0)
    word = reberGrammar.sequenceToWord(sample_sequence)
    print "inGrammar(",word,") = ", reberGrammar.in_grammar(word)
inGrammar( BTXXVVE ) =  True
inGrammar( BTSSXSE ) =  True
inGrammar( BSXXXTTVPSE ) =  False
inGrammar( BTSSXSXXTVPXTTVPXTVVE ) =  False
inGrammar( BPVPXVPSE ) =  True
inGrammar( BTSSXXTTTVVE ) =  True
inGrammar( BPVPXVVE ) =  True
inGrammar( BPTVPSE ) =  True
inGrammar( BTXSE ) =  True
inGrammar( BPTVVE ) =  True
inGrammar( BTSSSSXSE ) =  True
inGrammar( BPVPXTTTTTTTTVPSE ) =  True
inGrammar( BTSSSSSSXXTTVPSE ) =  True
inGrammar( BTXSXXTTVVE ) =  False
inGrammar( BPTVPSE ) =  True
inGrammar( BTXSE ) =  True
inGrammar( BTSXSE ) =  True
inGrammar( BTSXSE ) =  True
inGrammar( BTSSSXXTTVVE ) =  True
inGrammar( BTSXXTTTTVVE ) =  True

As you can see most of the generated strings are in the grammar. With a little bit of tuning it should be possible to produce mainly correct reber strings.

Long-range dependencies

Contrary to the reber grammar the embedded reber grammar data set has a long range dependency: The second symbol is the same as the second to last symbol.

In [23]:
i, o = reberGrammar.get_one_embedded_example()
print i[1]
print o[-2]
[ 0.  1.  0.  0.  0.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.]

This long range dependency can not be trained effectively with an elman net by backpropagation-through-time, because of the vanishing gradient problem [H91]:

In [24]:
W_ih, W_hh, b_h, W_ho, b_o, h0 = get_parameter(n_in, n_out, n_hid)   
params = [W_ih, W_hh, b_h, W_ho, b_o, h0]

# hidden and outputs of the entire sequence
[h_vals, o_vals], _ = theano.scan(fn=one_step, 
                                  sequences = dict(input=v, taps=[0]), 
                                  outputs_info = [h0, None], # corresponds to return type of fn
                                  non_sequences = [W_ih, W_hh, b_h, W_ho, b_o] )

cost = -T.mean(target * T.log(o_vals)+ (1.- target) * T.log(1. - o_vals))

learn_rnn_fn = get_train_functions(cost, v, target)


train_data = reberGrammar.get_n_embedded_examples(1000)
train_errors = train_rnn(train_data)
In [25]:
predictions = theano.function(inputs = [v], outputs = o_vals)

i, o = reberGrammar.get_one_embedded_example()
print i[1]
print o[-2]
print predictions(i)[-2]
[ 0.  1.  0.  0.  0.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.]
[  1.01969663e-05   5.41333497e-01   9.26787849e-04   9.03664622e-04
   4.43905503e-01   1.12090493e-05   4.53989524e-05]

Note that the second and fifth input/output neuron code for the two embedding symbols have high probability (about 50%) nearly independent of the correct output.

A solution for this problem is the LSTM recurrent neural network.

Exercise: Scan Operator

The data (inputs and targets) are now 3d tensors with the sequence index as first dimension:

In [26]:
x = np.array(np.arange(0., 2 * np.pi, 0.2), dtype=dtype)
x = x.reshape(len(x), 1)
s_ = np.concatenate([np.sin(x), np.cos(x)], axis=1)

sh_ = np.roll(s_, 1, axis = 0)
S_ = np.array([s_,sh_])
T_ = np.roll(S_, 1 , axis=1)

The input is S_ and the output T_:

In [27]:
V = T.tensor3(dtype=dtype)
Ta = T.tensor3(dtype=dtype)

For educational purposes you should nest scan in the exercise to get more familiar the theano scan.

Solution

In [28]:
n_in  = 2
n_hid = 4
n_out = 2

W_ih, W_hh, b_h, W_ho, b_o, h0 = get_parameter(n_in, n_out, n_hid)   
params = [W_ih, W_hh, b_h, W_ho, b_o, h0]

# learning rate
lr = np.cast[dtype](.1)
learning_rate = theano.shared(lr)
In [29]:
# sequences: x_t
# prior results: h_tm1
# non-sequences: W_ih, W_hh, W_ho, b_h
def one_step(x_t, h_tm1, W_ih, W_hh, b_h, W_ho, b_o):
    h_t = T.tanh(theano.dot(x_t, W_ih) + theano.dot(h_tm1, W_hh) + b_h)
    y_t = theano.dot(h_t, W_ho) + b_o 
    # no squashing for the output - it's a regression task
    return [h_t, y_t]

Scan iterates over the leading dimension of the sequences.

In [30]:
def loop_over_examples(x):                             
  # hidden and outputs of the entire sequence
  [h_vals, o_vals], _ = theano.scan(fn=one_step,
                sequences = dict(input = x, taps=[0]),
                outputs_info = [h0, None], # corresponds to return type of one_step
                non_sequences = [W_ih, W_hh, b_h, W_ho, b_o]
                )
  return o_vals
#  return y_vals



O_vals, _ = theano.scan(fn = loop_over_examples,
                        sequences = dict(input = V, taps=[0]),
                        outputs_info = None
                        )

f = theano.function(inputs=[V], outputs=O_vals)
In [31]:
# It's a regression task, so the cost is the mean of the squared error losses: 
cost = ((O_vals - Ta)**2).mean()


gparams = []
for param in params:
  gparam = T.grad(cost, param)
  gparams.append(gparam)

updates=[]
for param, gparam in zip(params, gparams):
    updates.append((param, param - gparam * learning_rate))



learn_rnn_fn = theano.function(inputs = [V, Ta],
                               outputs = cost,
                               updates = updates)
In [32]:
for i in xrange(10000):
    train_cost = learn_rnn_fn(S_,T_)
train_cost
Out[32]:
array(0.00013053907605353743, dtype=float32)
In [33]:
predictions = theano.function(inputs = [V], outputs = O_vals)
In [34]:
P = predictions(S_)
P - T_
Out[34]:
array([[[ -7.19032288e-02,  -1.51873231e-02],
        [  1.49764298e-02,   5.75470924e-03],
        [ -1.65076554e-03,   1.21617913e-02],
        [ -4.97138500e-03,   6.02239370e-03],
        [ -3.87209654e-03,   3.20142508e-03],
        [ -3.86077166e-03,   3.84402275e-03],
        [ -3.69441509e-03,   2.22593546e-03],
        [ -5.04136086e-04,  -3.21844220e-03],
        [  4.14699316e-03,  -7.26194680e-03],
        [  6.27458096e-03,  -4.86717187e-03],
        [  4.32229042e-03,   2.32505798e-03],
        [ -1.08122826e-04,   7.85854459e-03],
        [ -4.36645746e-03,   6.85900450e-03],
        [ -6.15882874e-03,   2.14874744e-04],
        [ -4.15265560e-03,  -6.32989407e-03],
        [  9.48488712e-04,  -7.26580620e-03],
        [  5.63132763e-03,  -2.03198195e-03],
        [  5.54761291e-03,   4.99790907e-03],
        [  2.00062990e-04,   8.23211670e-03],
        [ -4.99019027e-03,   4.79072332e-03],
        [ -4.36818600e-03,  -2.94905901e-03],
        [  5.58316708e-04,  -8.26257467e-03],
        [  3.40670347e-03,  -6.05040789e-03],
        [  1.43980980e-03,   1.38863921e-03],
        [ -2.03424692e-03,   6.51907176e-03],
        [ -2.81870365e-03,   4.85010445e-03],
        [ -2.66373158e-04,  -8.90910625e-04],
        [  3.29697132e-03,  -4.68042493e-03],
        [  5.01155853e-03,  -3.27384472e-03],
        [  2.86394358e-03,   1.32173300e-03],
        [ -3.54745984e-03,   4.16660309e-03],
        [ -1.21185482e-02,   2.05302238e-03]],

       [[  7.47956336e-02,   1.48375034e-02],
        [ -3.86624187e-02,  -6.11352921e-03],
        [  2.96305548e-02,  -1.21138096e-02],
        [  8.16561282e-03,  -1.06803179e-02],
        [  1.22517347e-03,  -5.92654943e-03],
        [ -1.11764669e-03,  -6.71744347e-05],
        [ -3.37606668e-03,   4.17947769e-03],
        [ -3.91870737e-03,   3.07285786e-03],
        [ -6.66975975e-04,  -2.74935365e-03],
        [  4.12893295e-03,  -7.10009038e-03],
        [  6.30688667e-03,  -4.82780300e-03],
        [  4.35113907e-03,   2.32927501e-03],
        [ -9.25660133e-05,   7.85452127e-03],
        [ -4.36019897e-03,   6.85405731e-03],
        [ -6.15710020e-03,   2.11536884e-04],
        [ -4.15247679e-03,  -6.33144379e-03],
        [  9.48339701e-04,  -7.26610422e-03],
        [  5.63119352e-03,  -2.03180313e-03],
        [  5.54755330e-03,   4.99802828e-03],
        [  2.00003386e-04,   8.23223591e-03],
        [ -4.99022007e-03,   4.79078293e-03],
        [ -4.36818600e-03,  -2.94905901e-03],
        [  5.58316708e-04,  -8.26257467e-03],
        [  3.40670347e-03,  -6.05040789e-03],
        [  1.43980980e-03,   1.38863921e-03],
        [ -2.03424692e-03,   6.51907176e-03],
        [ -2.81870365e-03,   4.85010445e-03],
        [ -2.66373158e-04,  -8.90910625e-04],
        [  3.29697132e-03,  -4.68042493e-03],
        [  5.01155853e-03,  -3.27384472e-03],
        [  2.86394358e-03,   1.32173300e-03],
        [ -3.54745984e-03,   4.16660309e-03]]], dtype=float32)