Introduction to Restricted Boltzmann Machines

with a tutorial implementation of RBMs in python (numpy).

Note: The purpose of this implemenation is to demonstrate the basic theory. This implemenation does not scale to real world examples, because it focuses on the principles.

Random variables

A Boltzmann machine consists of a set of binary random variables $s_i \in \{0, 1\}$ with two types of variables (also called units):

  • "observed" variables and
  • "hidden" variables.

The number of visible variables is $v$ and the number of hidden variables $w$.

The state vector of the observed variables is $\vec x$ and the state vector of the hidden variables $\vec y$.

For the visible units we have training data, e.g. two vectors $\vec x_1^T = (1,1,0)$ and $\vec x_2^T = (0,1,1)$. So each vector $\vec x_m$ correspond to a state of the visible units.

In [1]:
import numpy as np

# Training data in matrix form
# each row is an example
X = np.array([[1,1,0],[0,1,1]])

The Restriction: No direct connection between the units of the same layer

Each state of a Restricted Boltzmann machine (RBM) has an energy which is given by

$$ E_{\theta}(\vec x, \vec y) = - \vec{b}^T \cdot \vec x - \vec {c}^T \cdot \vec y - \vec {y}^T \hat W \vec x $$

with the parameters $\theta = \{ \vec b, \vec c, \hat W\}$.

  • $\vec b$ is a column vector with $v$ elements. So $\vec b^T$ is a row vector.
  • $\vec c$ is a column vector with $w$ elements.
  • $\hat W$ is matrix with $w$ rows and $v$ columns.

With numpy it's possible to compute the energy for many states in matrix notation. The combination of the $k$-rows in $X$ and $Y$ corresponds to a state.

In [2]:
def energy_function(data, param):
  X, Y = data
  (W, b, c) = param
  return -(np.dot(X, W.T) * Y).sum(axis=1) - (b * X).sum(axis=1) - (c * Y).sum(axis=1)

Undirected Graphical Model

A RBM is an undirected graphical model. In the vanilla form is each visible unit connected with each hidden unit.

Here we plot it with daft

In [3]:
from matplotlib import rc
rc("font", family="serif", size=16)
rc("text", usetex=True)
import daft
def plot_RBM():
    pgm = daft.PGM([6.3, 4.05], origin=[-1., -0.3], aspect=1.)
    pgm.add_node(daft.Node("x1", r"$x_1$", 1.5, 1, observed=True))
    pgm.add_node(daft.Node("x2", r"$x_2$", 2.5, 1, observed=True))
    pgm.add_node(daft.Node("x3", r"$x_3$", 3.5, 1, observed=True))
    
    pgm.add_node(daft.Node("y1", r"$y_1$", 2., 2.2))
    pgm.add_node(daft.Node("y2", r"$y_2$", 3, 2.2))
    
    # Add in the edges.

    pgm.add_edge("x1", "y1", directed=False)
    pgm.add_edge("x2", "y1", directed=False)
    pgm.add_edge("x3", "y1", directed=False)
    pgm.add_edge("x1", "y2", directed=False)
    pgm.add_edge("x2", "y2", directed=False)
    pgm.add_edge("x3", "y2", directed=False)

    # misuse of plates for plotting a box.
    #pgm.add_plate(daft.Plate([1.5, 1.7, 2., .9], label=r"hidden", shift=-0.1))
    #pgm.add_plate(daft.Plate([1., .5, 3., 1.], label=r"visible", shift=-0.1))
    pgm.render()
In [15]:
plot_RBM()

Boltzmann factor

The probability of a state $(\vec x, \vec y)$ (occupation probability in physics) is dependent on the energy $E(\vec x, \vec y)$ and is proportional to the Boltzmann factor

$$ p_\theta(\vec x, \vec y) \propto \exp(-\beta E(\vec x, \vec y)) $$
  • with $\beta = 1/(kT)$: $k$ is the Boltzmann constant and $T$ the temperature of the system.

Without loss of generality (by changing the parameters $\theta$) we set $\beta=1$.

In the following we write $p(..)$ instead of $p_\theta(..)$.

In [5]:
def boltzmann_factor(energy, beta=1.):
  beta = np.array(beta)
  energy = np.array(energy)
  return np.exp(- beta * energy)

To compute normalized probabilities we also need a normalization factor: $$ Z \equiv \sum_{\vec x',\vec y'} \exp(-E(\vec x', \vec y')) $$ where the sum is over all possible states of the system. Z (german: Zustandsumme) is called the partition function.

Note that to compute the partition function we need to sum over all possible states. This number grows exponentially with the number of units.

So the normalized probability is:

$$ p(\vec x, \vec y) = \frac{ \exp(-\beta E(\vec x, \vec y))}{Z} $$

For a first direct naive tutorial implementation of a restriced Boltzmann machine we need some python functions to generate all possible states for the computation of the partition function.

But be carefully, this can only be done with very small numbers of visible and hidden units! The number of summands (possible states) in $Z$ is $2^{v+w}$.

In [6]:
import itertools

X_DIM = X.shape[1] # The training data determine this number
Y_DIM = 2 # Let's have two hidden units

def get_combinations(dim):
  return np.array(list(itertools.product([0, 1], repeat=dim)))

def all_data_combinations(x_dim, y_dim):
  X_all = get_combinations(x_dim)
  Y_all = get_combinations(y_dim)
  XY_all = list(itertools.product(X_all, Y_all))
  return XY_all

def relations_as_matrices(XY_all):
  X_all = np.ndarray([len(XY_all), len(XY_all[0][0])])
  Y_all = np.ndarray([len(XY_all), len(XY_all[0][1])])
  for i, (x,y) in enumerate(XY_all):
    X_all[i,:] = x 
    Y_all[i,:] = y
  return X_all, Y_all

def cartesian_product_as_matrices(X, Y):
  XY_all = list(itertools.product(X, Y))  
  return relations_as_matrices(XY_all)
  
def all_data_combination_as_matrices(x_dim, y_dim):
  XY = all_data_combinations(x_dim, y_dim)
  return relations_as_matrices(XY)
  
# the combination of the k-th row of X_all and Y_all is one of all the possible states 
XY_all = XY_ALL = X_all, Y_all = all_data_combination_as_matrices(X_DIM, Y_DIM) 
assert len(Y_all) == len(X_all)
assert 2**(X_DIM + Y_DIM) == len(X_all)

# all possible 
Y_COM = get_combinations(Y_DIM)
assert 2**(Y_DIM) == len(Y_COM)

X_COM = get_combinations(X_DIM)
assert 2**(X_DIM) == len(X_COM)

So we can compute (naively) the partition function $Z$:

In [7]:
def partition_function(XY_ALL, param):
  (W, b, c) = param
  X_all, Y_all = XY_ALL
  return boltzmann_factor(energy_function((X_all, Y_all), param)).sum()

Marginalization

For computing the marginalized probability $p(\vec x)$ we must sum over all hidden states $\vec y'$:

$$ p(\vec x) = \sum_{\vec y'} p( \vec x, \vec y') = \sum_{\vec y'} \frac{ \exp(-E(\vec x, \vec y'))}{Z} $$
In [8]:
def partial_trace(X, param, Y_COM=Y_COM):
  """
  This is the inefficient version: no use of the 
  factorization property(see below) - should just be used in a test. 
  """
  XY_c = X_c, Y_c = cartesian_product_as_matrices(X, Y_COM)
  bf = boltzmann_factor(energy_function(XY_c, param))
  # sum over all same x state with different y's:
  # TODO can this be done more elegant?
  nb_x = len(X)
  nb_y = len(Y_COM)
  pt = np.zeros([nb_x])
  for i in range(nb_x):
    pt[i] = bf[i*nb_y: (i+1)*nb_y].sum()
  return pt

def p_v_(X, param, Z, Y_COM=Y_COM):
  """
  Probability of the visible states in matrix X.
  One state per row. 
  Inefficient - just for testing! 
  """
  return partial_trace(X, param, Y_COM=Y_COM) / Z

Free Energy

We can reformulate $p(\vec x)$ to get the same functional as for $p(\vec x, \vec y)$:

$$ p(\vec x) = \frac{1}{Z}\sum_{\vec y'} \exp(-E(\vec x, \vec y')) = \frac{1}{Z}\exp \left( \log \sum_{\vec y'} \exp(-E(\vec x, \vec y')) \right) $$

with the definition of the free energy for an $\vec x$-state $F(\vec x)$,

$$ F(x) \equiv -\log\left( \sum_{\vec y'} \exp(-E(\vec x, \vec y'))\right) $$

we get what we wanted: $$ p(\vec x) = \frac{ \exp(-F(\vec x))}{Z} $$

The normalizer $Z$ can be reformulated:

$$ Z = \sum_{\vec x} \exp(-F(\vec x)) $$

Factorization Property

The probability $p(\vec x)$ can be computed more efficently if we use the special form of the energy:

$$ p(\vec x) = \frac{1}{Z} \exp(-F(\vec x)) = \frac{1}{Z} \sum_{\vec y} \exp(-E(\vec x, \vec y) ) = \frac{1}{Z} \sum_{\vec y} \exp( \vec{b}^T \cdot \vec x + \vec {c}^T \cdot \vec y + \vec {y}^T \hat W \vec x ) $$$$ = \frac{1}{Z} \sum_{y_1} \sum_{y_2} \dots \sum_{y_w} \exp(\vec{b}^T \cdot \vec x + \vec {c}^T \cdot \vec y + \vec {y}^T \hat W \vec x ) $$$$ = \frac{\exp(\vec{b}^T \cdot \vec x)}{Z}\sum_{y_1} \sum_{y_2} \dots \sum_{y_w} \prod_i \exp(+ c_i y_i + y_i \vec W_i \vec x ) $$$$ = \frac{\exp(\vec{b}^T \cdot \vec x)}{Z} \sum_{y_1} \exp(c_1 y_1 + y_1 \vec W_1 \vec x ) \sum_{y_2} \exp( c_2 y_2 + y_2 \vec W_2 \vec x ) \dots \sum_{y_w} \exp(c_w y_w + y_w \vec W_w \vec x ) $$$$ = \frac{\exp( \vec{b}^T \cdot \vec x)}{Z} \prod_i \sum_{y_i} \exp(c_i y_i + y_i \vec W_i \vec x ) $$

So we don't need to sum over all possible states ($\sum_{\vec y}$) which is intractable. We have a product of factors and for each factor we just need so sum over the states of one hidden neuron $y_i$.

Note the factorization property: The probability is a product of factors.

We have binary units, i.e. each neuron has two states $y_i \in \{0, 1\}$, and with $\exp(0)=1$ the probability can be expressed as:

$$ p(\vec x) = \frac{\exp( \vec{b}^T \cdot \vec x)}{Z} \prod_i \left(1 + \exp( c_i + \vec W_i \vec x ) \right) $$

Respectivly the free energy $F(\vec x)$ can be computed by:

$$ F(\vec x) = - \vec{b}^T \cdot \vec x - \sum_i \log \left(1 + \exp( c_i + \vec W_i \vec x ) \right) $$

For convinience we call the components of the product resp. sum $\zeta_i$, so we can write a python function for the computation of the $\zeta$'s.

$$ \zeta_i = 1 + \exp( c_i + \vec W_i \vec x ) $$

Conditional Independence

Due to the factorization property the conditional factorizes, too:

$$ p(\vec y \mid \vec x) = \frac{p(\vec y, \vec x)}{p(\vec x)} = \frac{\exp( \vec{b}^T \cdot \vec x + \vec {c}^T \cdot \vec y + \vec {y}^T \hat W \vec x)} {\sum_{{\vec y}'}\exp( \vec{b}^T \cdot \vec x + \vec {c}^T \cdot \vec y' + \vec {y}'^T \hat W \vec x)}= $$$$ = \frac{\prod_i \exp( c_i y_i + y_i \vec W_i \vec x)} {\prod_i \sum_{{i}'}\exp( c_i y'_i + y'_i \vec W_i \vec x)}= \prod_i \frac{ \exp( c_i y_i + y_i \vec W_i \vec x)} {\sum_{i}\exp( c_i y'_i + y'_i \vec W_i \vec x)} = \prod_i p(y_i \mid \vec x) $$

Note: For the conditional it's not necessary to compute the partition function $Z$, so this computation is tractable.

So the hidden units of a RBM are conditional independent of each other given the visible units and vice versa:

$$ p(\vec y| \vec x) = \prod_i p(y_i| \vec x) $$$$ p(\vec x| \vec y) = \prod_j p(x_j| \vec y) $$

In graphical model terminology: the hidden units are separated under observing the visible units. That can be read from the graph.

Conditional probabilities of beeing active

The probability that a binary hidden unit $y_i$ is active is:

$$ p(y_i=1 \mid \vec x) = \frac{\exp(c_i + \vec W_i \vec x )}{1 + \exp(c_i + \vec W_i \vec x )} = \sigma\left(+c_i + \vec W_i \vec x \right) $$

with the logistic function $\sigma$.

Analog for $p(x_j=1 \mid \vec y)$.

Note that's the usual equation for a neuron’s output given its input.

In [9]:
def b_x(X, b):
  #return (b * X).sum(axis=1)
  return np.dot(X, b)

def zeta(X, W, c):
  return 1. + np.exp((np.dot(W, X.T).T + c).T)

def zeta_product(X, W, c):
  z = zeta(X, W, c) 
  r = z.prod(axis=0)  
  return r

def zeta_log_sum(X, W, c):
  z = zeta(X, W, c)  
  r = (np.log(z)).sum(axis=0)  
  return r


def p_v(X, param, Z):
  """
  more efficient computation of the probability p_v using the factorization property
  """
  (W, b, c) = param
  g = zeta_product(X, W, c)
  be = b_x(X, b)
  return  np.exp(be) * g / Z   

Negative Log-Probability

For training of a RBM we need a training objective (cost function):

Here we want to maximize the probability that the training data $\mathcal{D}$ is generated by sampling from the RBM in themodynamic equilibrium. With the iid-assumption (independent and identically distributed) with respect to the training data we want to maximize the product of the probabilities of the training examples $p(\vec x_k)$ which is equivalent of minimizing the negative log-likelihood:

$$ \mathcal{L}_{\mathcal{D}}(\theta) = - log \left( p(\mathcal D|\theta) \right) =- log \left( \prod_m p(\vec x_m) \right) = - \sum_m log \left( p(\vec x_m) \right) $$

How to compute the $p(\vec x_m)$ we know already (see above).

By some straightforward math we get (TODO: show this explicitly here):

In [10]:
def negative_log_likelihood_(X, XY_ALL=XY_ALL, Y_COM=Y_COM):
  """
  inefficent computation of the negative log likelihood - should only be used in a test
  """
  n = len(X)
  return lambda param: - ((np.log(partial_trace(X, param, Y_COM=Y_COM))).sum()
                          - n * np.log(partition_function(XY_ALL, param)))      



def negative_log_likelihood(X, XY_ALL=XY_ALL, Y_COM=Y_COM):
  """
  more efficent computation of the negative log likelihood by using the factorization property
  """
  n = len(X)
  return lambda (W, b, c): -(b_x(X, b) + zeta_log_sum(X, W, c)).sum()\
                            + n * np.log(partition_function(XY_ALL, (W,b,c)))

Gradient descent

We want to use gradient descent for the minimization of the cost function, therefore we need formulars for the gradients:

$$ \frac{\partial \log\left(p(\vec x)\right)}{\partial \theta} = \frac{\partial \log \left(\frac{ \exp(-F(\vec x))}{Z} \right) }{\partial \theta} = - \frac{\partial F(\vec x))}{\partial \theta} - \frac{\partial \log(Z)}{\partial \theta} = - \frac{\partial F(\vec x))}{\partial \theta} - \frac{1}{Z}\frac{\partial Z}{\partial \theta} $$

and with $$ Z = \sum_{\vec x'} \exp(-F(\vec x')) $$

this is

$$ \frac{\partial \log\left(p(\vec x)\right)}{\partial \theta} = - \frac{\partial F(\vec x)}{\partial \theta} - \frac{1}{Z}\frac{\partial \left(\sum_{\vec x'} \exp\left(-F(\vec x')\right)\right)}{\partial \theta} = - \frac{\partial F(\vec x))}{\partial \theta} - \sum_{\vec x'}\frac{\exp\left(-F(\vec x') \right)}{Z} \frac{\partial \left( -F(\vec x')\right)}{\partial \theta} = - \frac{\partial F(\vec x)}{\partial \theta} + \sum_{\vec x'} p(\vec x') \frac{\partial F(\vec x')}{\partial \theta} $$

So the expectation of the gradient over all training examples $\mathcal{D}$ can be expressed as

$$ \mathbb{E}_\mathcal{D} \left(\frac{\partial \log\left(p(\vec x)\right)}{\partial \theta} \right) = - \mathbb{E}_\mathcal{D} \left( \frac{\partial F(\vec x)}{\partial \theta} \right) + \mathbb{E}_\mathcal{S} \left( \frac{\partial F(\vec x')}{\partial \theta} \right) $$

$\mathbb{E}_\mathcal{S}$ is the expectation under the model distribution. For our first naive implementation that's exactly the weighted sum over all possible states as in the formular above.

Note on the expectation of the gradient:

  • The first term is data dependent. The free energy of our training examples should be lowered.
  • The second term is independent of the training data. The averaged free energy of all possible states should be pushed up.

Untill now we have expressed the partial derivaties of the log-probabilities as partial derivaties of the free energy.

Recap the functional form of the free energy

$$ F(\vec x) = - \vec{b}^T \cdot \vec x - \sum_i \log \left(1 + \exp(c_i + \vec W_i \vec x ) \right) $$

and the logistic function (sigmoid function) $$ \sigma(z) = \frac{1}{1+\exp(-z)} $$

with this in mind we get for the derivatives with respect to the parameters of the RBM:

$$ \frac{\partial F(\vec x)}{\partial b_j} = - x_j $$$$ \frac{\partial F(\vec x)}{\partial c_i} = - \frac{\exp(c_i + \vec W_i \vec x )}{1 + \exp(c_i + \vec W_i \vec x ) } = - \sigma (c_i + \vec W_i \vec x ) $$$$ \frac{\partial F(\vec x)}{\partial W_{ij}} = - \frac{\exp(c_i + \vec W_i \vec x ) * x_j}{1 + \exp(c_i + \vec W_i \vec x ) } = - \sigma (c_i + \vec W_i \vec x ) * x_j $$
In [11]:
def grad(X_, param):
  (W, b, c) = param
  s = sigmoid(c + np.dot(X_, W.T))
  grad_c = - s
  grad_b = - X_
  nb_x = len(X_)
  i, j = W.shape
  grad_W = np.zeros([nb_x, i, j])
  # TODO can this be done without a python loop directly in numpy?
  for k, gW in enumerate(grad_W):
    grad_W[k] = - np.outer(s[k], X_[k])     
  return grad_W, grad_b, grad_c 
  
def sigmoid(x):
  return 1. / (1. + np.exp(-x))
    

Now put all together for our (naive) gradient descent implementation:

In [12]:
def compute_new_params(param, X, XY_ALL=XY_ALL, X_COM=X_COM, cost_function = None):
  (W, b, c) = param
  Z = partition_function(XY_all, param)
  
  # positive phase
  grad_W, grad_b, grad_c = grad(X, param)
  p_X = p_v(X, param, Z)
  expectation_grad_b = (grad_b).mean(axis=0)
  expectation_grad_c = (grad_c).mean(axis=0)
  expectation_grad_W = (grad_W).mean(axis=0)

  # negative phase
  grad_W_all, grad_b_all, grad_c_all = grad(X_COM, param)
  p_X_all = p_v(X_COM, param, Z)
  expectation_grad_b_all = (grad_b_all.T * p_X_all).T.sum(axis=0)  
  expectation_grad_c_all = (grad_c_all.T * p_X_all).T.sum(axis=0)
  expectation_grad_W_all = (grad_W_all.T * p_X_all).T.sum(axis=0) 
  
  # update rules
  W = W + alpha * (expectation_grad_W_all - expectation_grad_W)
  b = b + alpha * (expectation_grad_b_all - expectation_grad_b)
  c = c + alpha * (expectation_grad_c_all - expectation_grad_c)
  
  #TODO test for gradient computation with numerical derivatives 
  
  param = (W, b, c) 
  return param

#TODO: at moment Z is computed twice: in the cost function!
def gradient_descent(param, X, num_iters, XY_ALL=XY_ALL, X_COM=X_COM, Y_COM=Y_COM):
   
  # for plotting the progress
  cost_history = np.zeros(num_iters+1)
  cost = negative_log_likelihood(X, XY_ALL=XY_ALL, Y_COM=Y_COM) 
  
  for i in xrange(num_iters):
    cost_history[i] = cost(param)  
    param = compute_new_params(param, X, XY_ALL=XY_ALL, X_COM=X_COM, cost_function = cost)
  cost_history[num_iters] = cost(param)  
  
  return param, cost_history
In [13]:
#### Example

# random initialization of the parameters
W = 5. * np.random.random((Y_DIM, X_DIM))
b = 5. * np.random.random(X_DIM)
c = 5. * np.random.random(Y_DIM)

param = (W, b, c)

alpha = .5
num_iters = 4000


param, history_cost = gradient_descent(param, X, num_iters, XY_ALL=XY_ALL, X_COM=X_COM, Y_COM=Y_COM)

(W, b, c) = param

print "Learned parameters: \n" 
print "W: \n", W
print "\nb: \n", b
print "\nc: \n", c, "\n"

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(range(num_iters+1), history_cost)
plt.xlabel("Iterations")
plt.ylabel("Cost")
Learned parameters: 

W: 
[[ -3.2211962    3.65792553   4.77703052]
 [ 11.98323635  -0.2662957  -12.19952062]]

b: 
[-2.74029305  5.40227612  1.29613873]

c: 
[ 2.25874645  0.2530737 ] 

Out[13]:
<matplotlib.text.Text at 0x117ffc5d0>

In RBM the cost function is not a convex (downward) function. The error function has lots of local minima. Therefore the solution yield by gradient descent depends on the start values of $\theta$. Try multiple runs to see differences in the solutions.

In [14]:
cost = negative_log_likelihood(X, XY_ALL=XY_ALL, Y_COM=Y_COM) 
print "negativ-log-likelihood: ", cost(param), '\n'


# is symmetry fullfilled 
(W, b, c) = param
c_ = np.array(c[::-1])
W_ = np.array(W[::-1])
assert np.equal(cost((W_, b, c_)), cost(param)) 



cost_ = negative_log_likelihood_(X, XY_ALL=XY_ALL, Y_COM=Y_COM)
assert np.isclose(cost(param), cost_(param))
     
hidden = sigmoid(c + np.dot(X, W.T))
print "probability of hidden states to be on for training data: \n", hidden

#visible = sigmoid(b + np.dot(hidden, W)) 
#print visible
negativ-log-likelihood:  1.39611501902 

probability of hidden states to be on for training data: 
[[  9.36759152e-01   9.99993669e-01]
 [  9.99977313e-01   4.96673642e-06]]

The minimal cost function with two training examples should be -2. * log(0.5) = 1.3863, if both examples are equal probable (=0.5) and visible states not in training data have probability zero.

Further steps

For real world data this method don't work. We can't enumerate all possible states. So learning directly the parameters is not possible. There are special methods to overcome this problem e.g. Contrastive Divergence.

Even with given parameters the computation of $Z$ is in general intractable. So computing $p(x)$ is only possible by appropriate approximations.

Read Hinton, Geoffrey. “A practical guide to training restricted Boltzmann machines.” Momentum 9.1 (2010): 926. to get familiar with more sophisticated methods.

For a tutorial implementation of a RBMs with Theano, see the RBM Deep Learning Tutorial.

Literature

  • Y. Bengio, Learning Deep Architectures for AI (Chapter 5), Found. Trends Mach. Learn., Jan 2009