## Gaussian Processes¶

Code and explanation mostly from Nando de Freitas, http://www.cs.ubc.ca/~nando/540-2013/lectures.html

In [29]:
import numpy as np
import matplotlib.pyplot as pl
%matplotlib inline


#### Assumption¶

• Data is smooth, i.e. if the $\vec x$ are similar then the $y$-values are similar.

"Smoothness Prior"

#### Noisless GP Regression¶

$$\mathcal D_{train} = \{ (\vec x^{(1)}, f^{(1)}), \dots, (\vec x^{(m)}, f^{(m)})\}$$

with

• $n$-dimensional vectors $\vec x^{(j)}$
• $f$ is a function of $\vec x$: $f^{(m)} = f^{(m)}(x^{(m)})$

Goal:

Given a test set $X_*$ of size $m_* \times n$ we want to predict the (function) outputs $f_*$:

$$\left(\begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left(\left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left(\begin{array}{cc} K & K_* \\ K_*^T & K_{**} \end{array} \right) \right)$$

$$\vec f^T = (f^{(1)}, f^{(2)}, \dots, f^{(n)})$$ and analog $\vec f_*^T$, $\vec \mu$ and $\vec \mu_*$

#### Kernel¶

with

• $K = \kappa(X, X)$ is $m \times m$
• $K_* = \kappa(X, X_*)$ is $m \times m_*$
• $K_{**} = \kappa(X_*, X_*)$ is $m_* \times m_*$

and the kernel $\kappa$, e.g.
$$\kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right)$$

• $l$: hyper parameter
• $\sigma_f^2$: ?
In [2]:
def kernel(a, b, kernelParameter = 0.1, sigma_square = 1.):
""" GP squared exponential kernel

\kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right)
Parameters
----------
a : numpy array (shape: m x n)
Design matrix:  row index == data index; column index == feature index

b : numpy array (shape: m_* x n)
Design matrix:  row index == data index; column index == feature index

kernelParameter: float
kernelParameter = l^2
sigma_square: float
\sigma_f^2

"""
# efficient implementation in numpy:
# squared euclidian distance between all row vectors of a and b.
sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)

return sigma_square * np.exp(-.5 * (1/kernelParameter) * sqdist)

In [3]:
X = np.array([[1,2,3],[2,2,3],[3,2,3]])
X_ = np.array([[4,2,3],[2,2,4]])
kernel(X, X_, 2)

Out[3]:
array([[ 0.10539922,  0.60653066],
[ 0.36787944,  0.77880078],
[ 0.77880078,  0.60653066]])

#### Multivariate Bayesian Theorem¶

Let $x_1$ and $x_2$ be random vectors distributed joinly with

$$p(\vec x_2 \mid \vec x_1) = \mathcal N \left( \vec x_2 \mid \vec \mu_{2 \mid 1}, \Sigma_{2 \mid 1} \right)$$

$$\mu_{2 \mid 1} = \mu_2 + \Sigma_{2 \mid 1} \Sigma_{11}^{-1}(\vec x_1 - \mu_1)$$

$$\Sigma_{2 \mid 1} = \Sigma_{22} - \Sigma_{}$$

$$p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*)$$

with (by the Multivariate Gaussian Theorem, see K.Murphy ML, Theorem 4.2.1): $$\vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec f - \vec \mu (X) )$$ $$\Sigma_{*} = K_{**} - K_*^T K^{-1} K_*$$

What is $\vec \mu (X_*)$ resp. $\vec \mu (X)$ ? These are the means of the priors.

#### Noisy GP Regression¶

$$y = f(\vec x) + \epsilon$$ with

• $\epsilon = \mathcal N(0, \sigma_y^2)$

$$p(\vec y \mid X) = \int p(\vec y \mid f, X) p(\vec f \mid X) d\vec f$$

with zero-mean prior: $$p(\vec f \mid X) = \mathcal N (f \mid 0, K)$$

the noise of each data point is independent: $$p(y \mid f) = \prod_{j=1}^m \mathcal N(f^{(i)}, \sigma_y^2)$$

with the multivariate Gaussian theorem:

$$\text{cov} \left[ \vec y \mid X \right] = K + \sigma_y^2 {\bf I}_n =: K_y$$ so we just need to modify the kernel: $K \rightarrow K_y$

$$\left( \begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left( \left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left( \begin{array}{cc} K_y & K_* \\ K_*^T & K_{**} \end{array} \right) \right)$$

$$p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*)$$

$$\vec \mu_* = \vec \mu (X_*) + K_*^T K_y^{-1} ( \vec y - \vec \mu (X) )$$ $$\Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_*$$

Prior for the noise (inverse wishard distribution in 1-D) i.e. inverse Gamma distribution, see

https://en.wikipedia.org/wiki/Inverse-gamma_distribution

In [4]:
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten()
#f = lambda x: (0.25*(x**2)).flatten()

In [5]:
N = 5        # number of training points.
n = 100         # number of test points.
s = 0.00005    # noise variance.

# Sample some input points and noisy versions of the function evaluated at
# these points.
X = np.random.uniform(-5, 5, size=(N,1))
y = f(X) + s*np.random.randn(N)

In [6]:
# this is also a parameter, we must estimate from the data!!! TODO
# e.g. by max. likelihood, etc.
s_y = 0.001

In [7]:
var_y = np.eye(len(X)) * s_y

In [8]:
# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, n).reshape(-1,1)

In [9]:
kernelParameter = 2.

In [10]:
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)

# naive numerically unstable version
K_inv = np.linalg.inv(K_y)
alpha = np.dot(K_inv, y)
mu = np.dot(k_.T, alpha)

In [11]:
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)
s_2 = K_ - np.dot( k_.T, np.dot(K_inv, k_) )
s_ = np.sqrt(np.diag(s_2))

In [12]:
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])

Out[12]:
[-5, 5, -3, 3]

#### Faster and numerically more stable implementation¶

In the implementation there is a matrix inversion of $K_y$. The matrix $K_y$ is symmetric and positive definite.

The inversion of such a matrix can be done more stable and faster by using the Cholesky decomposition.

A symmetric (more general Hermitian) positive definite matrix $K$ can be factorized $$K = L L^T$$ with

• $L$ is a lower triangular matrix with real and positive diagonal entries
• and $L^T$ is the transpose of $L$.
In [13]:
# same code snipped as above
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y

In [14]:
L = np.linalg.cholesky(K_y)
L

Out[14]:
array([[ 1.00049988,  0.        ,  0.        ,  0.        ,  0.        ],
[ 0.09580453,  0.99590235,  0.        ,  0.        ,  0.        ],
[ 0.13339719,  0.97872192,  0.15908673,  0.        ,  0.        ],
[ 0.20207284,  0.91560838,  0.33473585,  0.09889277,  0.        ],
[ 0.14997737, -0.01421492, -0.03578134, -0.044307  ,  0.98745193]])
In [15]:
np.allclose(L, np.tril(L)) # check if lower triangular

Out[15]:
True

It is much easier to compute the inverse of a triangular matrix and there exist numerical solutions, see http://www.mcs.csueastbay.edu/~malek/TeX/Triangle.pdf

So we use this to compute $K_y^{-1}$:

$$K_{y}^{-1} = \left(L L^T \right)^{-1} = \left(L^T\right)^{-1} L^{-1} = L^{-T} L^{-1}$$

Note: $\left( A B\right)^{-1} = B^{-1} A^{-1}$

$$\vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec y - \vec \mu (X) )$$ with prior $\mu (.) = 0$ $$\vec \mu_* = K_*^T K_y^{-1} \vec f = K_*^T L^{-T} L^{-1} \vec f = \left( (L^{-1} K_*)^T \right) \left( L^{-1} \vec f \right)$$

For solving $\left( (L^{-1} K_*)^T \right)$ and $\left( L^{-1} \vec f \right)$ we can use the fact that $L$ is a lower triangular matrix.

With scipy by scipy.linalg.solve_triangular.

In [16]:
import scipy.linalg
help(scipy.linalg.solve_triangular)

Help on function solve_triangular in module scipy.linalg.basic:

solve_triangular(a, b, trans=0, lower=False, unit_diagonal=False, overwrite_b=False, debug=None, check_finite=True)
Solve the equation a x = b for x, assuming a is a triangular matrix.

Parameters
----------
a : (M, M) array_like
A triangular matrix
b : (M,) or (M, N) array_like
Right-hand side matrix in a x = b
lower : bool, optional
Use only data contained in the lower triangle of a.
Default is to use upper triangle.
trans : {0, 1, 2, 'N', 'T', 'C'}, optional
Type of system to solve:

========  =========
trans     system
========  =========
0 or 'N'  a x  = b
1 or 'T'  a^T x = b
2 or 'C'  a^H x = b
========  =========
unit_diagonal : bool, optional
If True, diagonal elements of a are assumed to be 1 and
will not be referenced.
overwrite_b : bool, optional
Allow overwriting data in b (may enhance performance)
check_finite : bool, optional
Whether to check that the input matrices contain only finite numbers.
Disabling may give a performance gain, but may result in problems
(crashes, non-termination) if the inputs do contain infinities or NaNs.

Returns
-------
x : (M,) or (M, N) ndarray
Solution to the system a x = b.  Shape of return matches b.

Raises
------
LinAlgError
If a is singular

Notes
-----


In [17]:
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)

In [18]:
Lk = scipy.linalg.solve_triangular(L, k_, lower=True)
Ly = scipy.linalg.solve_triangular(L, y, lower=True)
mu = np.dot(Lk.T, Ly)


analog $$\Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_* = K_{**} - K_*^T L^{-T}L^{-1} K_* = K_{**} - \left(L^{-1} K_*\right)^T \left(L^{-1} K_* \right)$$

In [21]:
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)

# we need only the diagonal elements of $\Sigma_*$:
s2 = np.diag(K_) - np.sum(Lk**2, axis=0)
s_ = np.sqrt(s2)

In [23]:
# PLOTS:
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])

Out[23]:
[-5, 5, -3, 3]
In [30]:
##TODO explaination

# draw samples from the prior at our test points.
L = np.linalg.cholesky(K_  + s * np.eye(n) )
f_prior = np.dot(L, np.random.normal(size=(n,10)))
pl.figure(2)
pl.clf()
pl.plot(Xtest, f_prior)
pl.title('Ten samples from the GP prior')
pl.axis([-5, 5, -3, 3])
pl.savefig('prior.png', bbox_inches='tight')

# draw samples from the posterior at our test points.
L = np.linalg.cholesky(K_ + s * np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,10)))
pl.figure(3)
pl.clf()
pl.plot(Xtest, f_post)
pl.title('Ten samples from the GP posterior')
pl.axis([-5, 5, -3, 3])
pl.savefig('post.png', bbox_inches='tight')


TODO: How to determine the hyperparameters s_y, kernel_parameter etc.

Python Library:

### Literature:¶

#### Gaussian Processes¶

• Lectures of Nando de Freitas: Machine Lerning: Gaussian processes for nonlinear regression, http://www.cs.ubc.ca/~nando/540-2013/lectures.html
• Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
• Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning (Vol. 1). Cambridge: MIT press.