## Univariate Linear Regression¶

#### Learning Objectives¶

• (univariate) linear regression
• concepts of machine learning
• model
• learning with a training set
• cost function

### Problem Formulation¶

#### supervised learning (linear regression)¶

$m$-observations: {$x^{(i)}$} with target values {$y^{(i)} \in \mathbb R$}

• Goal: prediction of $y$ for a new $x$.

• Linear Model

• Linear equation (equation of a straight line): $$h_\Theta (x) = \theta_0 + \theta_1 x$$

Note: There are two parameters: $\theta_0,\theta_1$ (parametric model)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
###blood haemoglobin (Hb) levels and packed cell volumes (PCV) of 14 female blood bank donors
blood = np.array([
[15.5,0.450],
[13.6,0.420],
[13.5,0.440],
[13.0,0.395],
[13.3,0.395],
[12.4,0.370],
[11.1,0.390],
[13.1,0.400],
[16.1,0.445],
[16.4,0.470],
[13.4,0.390],
[13.2,0.400],
[14.3,0.420],
[16.1,0.450]])

x = blood[:,0]
y = blood[:,1]

from numpy import arange, array, ones, linalg
A = array([ x, ones(len(x))])

theta = linalg.lstsq(A.T,y)[0] # obtaining the parameters

line = theta[0]*x+theta[1] # regression line
#plt.plot(x,line,'r-',x,y,'o')


#### Example¶

In [3]:
plt.scatter(x,y, label="Experimental Values")
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV) / ?")
plt.legend()
plt.show()


#### Idea¶

Find a line $h_{\Theta}(x)$ that goes "as near as possible" through the data.

So we are looking for values of the parameters $\Theta = \{\theta_0, \theta_1\}$

A quantification for "as near as possible" is given later. Any ideas?

In [4]:
plt.scatter(x,y, label="Experimental Values")
plt.plot(x,line,'r-', label='Fit with $\Theta_0 =$' + str(round(theta[1], 3)) +
"  $\Theta_1 =$" + str(round(theta[0],4)) )
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV)")
plt.legend()
plt.show()


### Training set¶

• $m$: number of training examples
• $x$: input variable
• $y$: output variable
• $(x^{(i)}, y^{(i)})$: $i$-th training example
$$\mathcal D_{train} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})\}$$

#### Example data set: Hg-PCV¶

haemoglobin level ($x$) packed cell volume ($y$)
15.5 0.450
13.6 0.420
13.5 0.440
13.0 0.395
$\dots$ $\dots$

#### Overview: Training Procedure¶

• Model: $h_\Theta (x) = \theta_0 + \theta_1 x$
• $h_\Theta(x)$ : hypothesis
• two parameters
• inference: estimating the model parameters $\bf \theta$ by learning from data
In [5]:
from IPython.display import Image
Image("../univariateLinearRegression/pics/trainings-procedure.png")

Out[5]:

#### Lineare regression with one variable (univariate linear regression)¶

Why the name Linear regression with one variable''?

• One variable: $x$
• Hypothesis - Model $h_\theta (x) = \theta_0 + \theta_1 x$
• Hypothesis is linear with respect to the parameters $\theta_0, \theta_1$.
• Prediction of a floating point number: Regression

• (Hypothesis is linear with respect to the variable $x$ - not the reason for linear regression)

Note:

Linear regression can be solved algebraically (with a pseudo inverse). This is not considered here.

### Cost function¶

Remember we are looking for a strait line that goes "as near as possible" through the data.

We give each possible strait line a value which quantifies the quality.

• The lower the value the better the "fit".
• That's the cost function (or error function).

#### (squared error) cost function :¶

$$J_{\mathcal D}(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$$

Typical cost function for regression (can be derived from the "Maximum Likelihood Principle".)

#### Goal: Minimization of the cost¶

$$\text{minimize}_{\theta} J(\theta)$$

Note:

• Hypothesis $h_{\Theta}(x)$ is a function of $x$ with fixed parameters $\Theta$.
• Cost function $J_{\mathcal D}(\Theta)$ is a function of $\Theta$ (for a fixed training data set $\mathcal D$)

We can reflect this fact in the implementation:

In [7]:
def get_linear_hypothesis(theta_0, theta_1):
return lambda x: theta_0 + theta_1 * x

def get_squared_error_cost_function(x, y, get_hypothesis):
assert(len(x)==len(y))
m = len(x)
return lambda theta_0, theta_1: 1. / (2. * m) * ((get_hypothesis(theta_0, theta_1)(x) - y)**2).sum()


#### Example:¶

no intersect ($\theta_0$), i.e. hypothesis is $h_{\theta_1}(x) = \theta_1 \cdot x$

training data: $\mathcal D = \{ (1, .5), (2, 3), (3, 4) \}$

In [8]:
x = np.array([0.,1., 3.])
y = np.array([0, 2., 4.])

costs = np.zeros([len(np.arange(0.,3.,0.1))])
thetas_1 = np.arange(0.,3.,0.1)
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_1 in enumerate(thetas_1):
costs[i] = cost_function(0., theta_1)

In [9]:
f, axarr = plt.subplots(1, 2, figsize=(8,4))
axarr[0].scatter(x,y); axarr[0].set_title('Data'); axarr[0].set_xlabel("x"); axarr[0].set_ylabel("y")
axarr[1].plot(thetas_1, costs); axarr[1].set_title('Cost');axarr[1].set_xlabel("$\Theta_1$"); axarr[1].set_ylabel("cost")

Out[9]:
<matplotlib.text.Text at 0x110a9c748>
In [10]:
x = y = np.arange(-1.0, 1.0, 0.1)
y = 3. + 2. * x + np.random.randn(len(y)) * 0.2

thetas = [[2., 1.5, 'c'], [2.5,3.,'k'], [3., 2., 'g']]

#computation of the cost-grid
w0s  = np.arange(1.5,4.5,0.02)
w1s = np.arange(.5,3.5,0.02)
cost = np.zeros([len(w0s),len(w1s)])
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_0 in enumerate(w0s):
for j, theta_1 in enumerate(w1s):
cost[i][j] = cost_function(theta_0, theta_1)

#contour-plot
X, Y = np.meshgrid(w0s, w1s)


In [11]:
plt.subplot(121);plt.plot(x,y,'+r');plt.xlabel('x');plt.ylabel('y');plt.title('Data and Hypotheses')
for t in thetas:
h = get_linear_hypothesis(t[0], t[1])(x)
plt.plot(x,h,'-'+t[2])
plt.subplot(122);plt.contour(X, Y, cost);plt.xlabel('$\Theta_0$');plt.ylabel('$\Theta_1$');plt.title('Contour Plot')
for t in thetas:
plt.plot(t[0],t[1],'o'+ t[2])
plt.tight_layout()

In [12]:
#3d-plot
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm
vmin = cost.min()
vmax = cost.max()

In [13]:
fig = plt.figure(figsize=(10,6))
ax.plot_surface(X, Y, cost, cmap=cm.jet, rstride=5, cstride=5, vmax=vmax, vmin = vmin, antialiased=True )
ax.set_xlabel('$\Theta_0$');ax.set_ylabel('$\Theta_1$');ax.set_zlabel('cost');ax.set_title('3D-Plot')

Out[13]:
<matplotlib.text.Text at 0x110eb6128>

Note:

The cost function for linear regression is convex, i.e. only one (global) minimum.

#### Cost function summary¶

• Cost is a function of the parameters.
• The goal is minimization of the cost for finding "good" parameters.
• Concept of cost function also for other kinds of models, e.g. neuronal networks or k-means clustering.

The gradient of a scalar function (e.g. cost function) is:

\begin{align*} \vec \nabla J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} \\ \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} \end{pmatrix} \end{align*}

mit dem Nabla-Operator \begin{align*} \vec \nabla = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} \end{align*}

Recap: Problem formulation

• Hypothesis: $h_{\Theta}(x) = \theta_0 +\theta_1 \cdot x$
• Parameters: $\Theta = \{\theta_0 , \theta_1\}$
• Cost function: $J(\Theta) = \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2$
• Goal: $\text{minimize}_{\Theta} J(\Theta)$

Goal: $\text{minimize}_{\Theta} J(\Theta)$

1. Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
2. In each iteration:
• Change the values of $\Theta$, such that $J(\Theta)$ becomes smaller.
• Use the gradient for finding new values in each iteration.

1. Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
2. Repeat (until a sufficent good minimum was found - stopping criterion):
• Change the values of $\Theta$ by using the partial derivatives (gradient) with the following update rule: $$\theta_j^{neu} \leftarrow \theta_j^{alt} - \alpha \frac{\partial}{\partial \theta_j} J(\Theta^{alt})$$ with the hyper parameter $\alpha>0$ (learning rate)

Note for the implementation: simultaneous update of all parameters

\begin{align*} temp0 &\leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \\ temp1 &\leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \\ \theta_0 &\leftarrow temp0 \\ \theta_1 &\leftarrow temp1 \\ \end{align*}

#### Partial derivative for $\theta_0$¶

$$\theta_0 \leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\Theta)$$\begin{align*} \frac{\partial}{\partial \theta_0} J(\Theta) &= \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2 \\ & = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \\ & = \frac{1}{m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \end{align*}

#### Partial derivative for $\theta_1$¶

$$\theta_1 \leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\Theta)$$

\begin{align} \frac{\partial}{\partial \theta1} J(\Theta) &= \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (h\Theta(x^{(i)})-y^{(i)})^2 \ & = \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (\theta0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \ & = \frac{1}{m} \sum{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \cdot x^{(i)} \end{align}

#### Step size¶

step size depends on two factors:

• size of the gradient $\vec \nabla J(\Theta)$
• learning rate $\alpha > 0$ (hyper parameter)

$\alpha$ must be choosen carefully.

#### Exercise: Implementation of gradient descent for the linear regression¶

with python and numpy

Use the cost function and the linear hypothesis (see above).

1. Write a python function for the update of the new $\Theta$ values:
theta_0, theta_1 = compute_new_theta(x, y, theta_0, theta_1, alpha)
2. Write a python function that given as input start values for $\theta_0$ and $\theta_1$ iteratively applies the the compute_new_theta' function to find "good" $\Theta$ values (in a python function).

• Use the synthetic train data (see below).
• Plot the progress (cost over iterations) of the training procedure.
3. Plot the best hypothesis (strait line) together with the data.

4. Try different values for the hyper parameter $\alpha$ and plot for all the progress (cost value over iterations) in a graph.

In [14]:
# generation of synthetic train data
x_min = -10.
x_max = 10.
m = 10

x = np.random.uniform(x_min, x_max, m)
a = 10.
b = 5.
y_noise_sigma = 4.
y = a + b * x + np.random.randn(m) * y_noise_sigma

plt.plot(x, y, "bo")
plt.xlabel("x")
plt.ylabel("y")

Out[14]:
<matplotlib.text.Text at 0x110dd7d68>`