Univariate Linear Regression¶

Learning Objectives¶

(univariate) linear regression
concepts of machine learning
- model
- learning with a training set
- cost function
- gradient descent

Problem Formulation¶

supervised learning (linear regression)¶

$m$-observations: {$ x^{(i)}$} with target values {$y^{(i)} \in \mathbb R$}

Goal: prediction of $y$ for a new $x$.
Linear Model
- Linear equation (equation of a straight line): $$ h_\Theta (x) = \theta_0 + \theta_1 x $$

Note: There are two parameters: $\theta_0,\theta_1$ (parametric model)

In [1]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

###blood haemoglobin (Hb) levels and packed cell volumes (PCV) of 14 female blood bank donors
blood = np.array([
  [15.5,0.450],
  [13.6,0.420],
  [13.5,0.440],
  [13.0,0.395],
  [13.3,0.395],
  [12.4,0.370],
  [11.1,0.390],
  [13.1,0.400],
  [16.1,0.445],
  [16.4,0.470],
  [13.4,0.390],
  [13.2,0.400],
  [14.3,0.420],
  [16.1,0.450]])

x = blood[:,0]
y = blood[:,1] 

from numpy import arange, array, ones, linalg
A = array([ x, ones(len(x))])

theta = linalg.lstsq(A.T,y)[0] # obtaining the parameters

line = theta[0]*x+theta[1] # regression line
#plt.plot(x,line,'r-',x,y,'o')

Example¶

In [3]:

plt.scatter(x,y, label="Experimental Values")
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV) / ?")
plt.legend()
plt.show()

Idea¶

Find a line $h_{\Theta}(x)$ that goes "as near as possible" through the data.

So we are looking for values of the parameters $\Theta = \{\theta_0, \theta_1\}$

A quantification for "as near as possible" is given later. Any ideas?

In [4]:

plt.scatter(x,y, label="Experimental Values")
plt.plot(x,line,'r-', label='Fit with $\Theta_0 = $' + str(round(theta[1], 3)) + 
         "  $\Theta_1 = $" + str(round(theta[0],4)) ) 
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV)")
plt.legend()
plt.show()

Training set¶

$m$: number of training examples
$x$: input variable
$y$: output variable
$(x^{(i)}, y^{(i)})$: $i$-th training example

$$ \mathcal D_{train} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})\} $$

Example data set: Hg-PCV¶

haemoglobin level ($x$)	packed cell volume ($y$)
15.5	0.450
13.6	0.420
13.5	0.440
13.0	0.395
$\dots$	$\dots$

Overview: Training Procedure¶

Model: $h_\Theta (x) = \theta_0 + \theta_1 x$
- $h_\Theta(x)$ : hypothesis
- two parameters
inference: estimating the model parameters $\bf \theta$ by learning from data

In [5]:

from IPython.display import Image
Image("../univariateLinearRegression/pics/trainings-procedure.png")

Out[5]:

Lineare regression with one variable (univariate linear regression)¶

Why the name ``Linear regression with one variable''?

One variable: $x$
Hypothesis - Model $h_\theta (x) = \theta_0 + \theta_1 x $
Hypothesis is linear with respect to the parameters $\theta_0, \theta_1$.
Prediction of a floating point number: Regression
(Hypothesis is linear with respect to the variable $x$ - not the reason for linear regression)

Note:

Linear regression can be solved algebraically (with a pseudo inverse). This is not considered here.

Cost function¶

Remember we are looking for a strait line that goes "as near as possible" through the data.

We give each possible strait line a value which quantifies the quality.

The lower the value the better the "fit".
That's the cost function (or error function).

(squared error) cost function :¶

$$ J_{\mathcal D}(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2 $$

Typical cost function for regression (can be derived from the "Maximum Likelihood Principle".)

Goal: Minimization of the cost¶

$$ \text{minimize}_{\theta} J(\theta) $$

Note:

Hypothesis $h_{\Theta}(x)$ is a function of $x$ with fixed parameters $\Theta$.
Cost function $J_{\mathcal D}(\Theta)$ is a function of $\Theta$ (for a fixed training data set $\mathcal D$)

We can reflect this fact in the implementation:

In [7]:

def get_linear_hypothesis(theta_0, theta_1):
    return lambda x: theta_0 + theta_1 * x

def get_squared_error_cost_function(x, y, get_hypothesis):
    assert(len(x)==len(y))
    m = len(x)
    return lambda theta_0, theta_1: 1. / (2. * m) * ((get_hypothesis(theta_0, theta_1)(x) - y)**2).sum()

Example:¶

no intersect ($\theta_0$), i.e. hypothesis is $ h_{\theta_1}(x) = \theta_1 \cdot x $

training data: $\mathcal D = \{ (1, .5), (2, 3), (3, 4) \}$

In [8]:

x = np.array([0.,1., 3.])
y = np.array([0, 2., 4.])


costs = np.zeros([len(np.arange(0.,3.,0.1))])
thetas_1 = np.arange(0.,3.,0.1)
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_1 in enumerate(thetas_1):
    costs[i] = cost_function(0., theta_1)

In [9]:

f, axarr = plt.subplots(1, 2, figsize=(8,4))
axarr[0].scatter(x,y); axarr[0].set_title('Data'); axarr[0].set_xlabel("x"); axarr[0].set_ylabel("y")
axarr[1].plot(thetas_1, costs); axarr[1].set_title('Cost');axarr[1].set_xlabel("$\Theta_1$"); axarr[1].set_ylabel("cost")

Out[9]:

<matplotlib.text.Text at 0x110a9c748>

In [10]:

x = y = np.arange(-1.0, 1.0, 0.1)
y = 3. + 2. * x + np.random.randn(len(y)) * 0.2

thetas = [[2., 1.5, 'c'], [2.5,3.,'k'], [3., 2., 'g']]


#computation of the cost-grid                                                                                              
w0s  = np.arange(1.5,4.5,0.02)
w1s = np.arange(.5,3.5,0.02)
cost = np.zeros([len(w0s),len(w1s)])
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_0 in enumerate(w0s):
    for j, theta_1 in enumerate(w1s):
        cost[i][j] = cost_function(theta_0, theta_1)

#contour-plot                                                                                                              
X, Y = np.meshgrid(w0s, w1s)

In [11]:

plt.subplot(121);plt.plot(x,y,'+r');plt.xlabel('x');plt.ylabel('y');plt.title('Data and Hypotheses')
for t in thetas:
    h = get_linear_hypothesis(t[0], t[1])(x)
    plt.plot(x,h,'-'+t[2])
plt.subplot(122);plt.contour(X, Y, cost);plt.xlabel('$\Theta_0$');plt.ylabel('$\Theta_1$');plt.title('Contour Plot')
for t in thetas:
    plt.plot(t[0],t[1],'o'+ t[2])
plt.tight_layout()

In [12]:

#3d-plot                                                                                                                   
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm
vmin = cost.min()
vmax = cost.max()

In [13]:

fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(X, Y, cost, cmap=cm.jet, rstride=5, cstride=5, vmax=vmax, vmin = vmin, antialiased=True )
ax.set_xlabel('$\Theta_0$');ax.set_ylabel('$\Theta_1$');ax.set_zlabel('cost');ax.set_title('3D-Plot')

Out[13]:

<matplotlib.text.Text at 0x110eb6128>

Note:

The cost function for linear regression is convex, i.e. only one (global) minimum.

Cost function summary¶

Cost is a function of the parameters.
The goal is minimization of the cost for finding "good" parameters.
Concept of cost function also for other kinds of models, e.g. neuronal networks or k-means clustering.

Gradient descent¶

Recap: Gradient¶

The gradient of a scalar function (e.g. cost function) is:

$$ \begin{align*} \vec \nabla J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} \\ \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} \end{pmatrix} \end{align*} $$

mit dem Nabla-Operator $$ \begin{align*} \vec \nabla = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} \end{align*} $$

Recap: Problem formulation

Hypothesis: $h_{\Theta}(x) = \theta_0 +\theta_1 \cdot x$
Parameters: $\Theta = \{\theta_0 , \theta_1\}$
Cost function: $J(\Theta) = \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2$
Goal: $\text{minimize}_{\Theta} J(\Theta)$

Gradient Descent Idea¶

Goal: $\text{minimize}_{\Theta} J(\Theta)$

Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
In each iteration:
- Change the values of $\Theta$, such that $J(\Theta)$ becomes smaller.
- Use the gradient for finding new values in each iteration.

Gradient descent procedure¶

Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
Repeat (until a sufficent good minimum was found - stopping criterion):
- Change the values of $\Theta$ by using the partial derivatives (gradient) with the following update rule: $$ \theta_j^{neu} \leftarrow \theta_j^{alt} - \alpha \frac{\partial}{\partial \theta_j} J(\Theta^{alt}) $$ with the hyper parameter $\alpha>0$ (learning rate)

Note for the implementation: simultaneous update of all parameters

\begin{align*} temp0 &\leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \\ temp1 &\leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \\ \theta_0 &\leftarrow temp0 \\ \theta_1 &\leftarrow temp1 \\ \end{align*}

Partial derivative for $\theta_0$¶

$$ \theta_0 \leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\Theta) $$\begin{align*} \frac{\partial}{\partial \theta_0} J(\Theta) &= \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2 \\ & = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \\ & = \frac{1}{m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \end{align*}

Partial derivative for $\theta_1$¶

$$ \theta_1 \leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\Theta) $$

\begin{align} \frac{\partial}{\partial \theta1} J(\Theta) &= \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (h\Theta(x^{(i)})-y^{(i)})^2 \ & = \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (\theta0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \ & = \frac{1}{m} \sum{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \cdot x^{(i)} \end{align}

Step size¶

step size depends on two factors:

size of the gradient $\vec \nabla J(\Theta)$
learning rate $\alpha > 0$ (hyper parameter)

$\alpha$ must be choosen carefully.

Exercise: Implementation of gradient descent for the linear regression¶

with python and numpy

Use the cost function and the linear hypothesis (see above).

Write a python function for the update of the new $\Theta$ values:

theta_0, theta_1 = compute_new_theta(x, y, theta_0, theta_1, alpha)

Write a python function that given as input start values for $\theta_0$ and $\theta_1$ iteratively applies the the `compute_new_theta' function to find "good" $\Theta$ values (in a python function).
- Use the synthetic train data (see below).
- Plot the progress (cost over iterations) of the training procedure.
Plot the best hypothesis (strait line) together with the data.
Try different values for the hyper parameter $\alpha$ and plot for all the progress (cost value over iterations) in a graph.

In [14]:

# generation of synthetic train data
x_min = -10.
x_max = 10.
m = 10

x = np.random.uniform(x_min, x_max, m)
a = 10.
b = 5.
y_noise_sigma = 4.
y = a + b * x + np.random.randn(m) * y_noise_sigma

plt.plot(x, y, "bo")
plt.xlabel("x")
plt.ylabel("y")

Out[14]:

<matplotlib.text.Text at 0x110dd7d68>

Literature:¶

Andrew Ng: Machine Learning. Openclassroom Stanford University, 2013
C. Bishop: Pattern recognition and Machine Learning, Springer Verlag 2006