Univariate Linear Regression

Learning Objectives

  • (univariate) linear regression
  • concepts of machine learning
    • model
    • learning with a training set
    • cost function
    • gradient descent

Problem Formulation

supervised learning (linear regression)

$m$-observations: {$ x^{(i)}$} with target values {$y^{(i)} \in \mathbb R$}

  • Goal: prediction of $y$ for a new $x$.

  • Linear Model

    • Linear equation (equation of a straight line): $$ h_\Theta (x) = \theta_0 + \theta_1 x $$

Note: There are two parameters: $\theta_0,\theta_1$ (parametric model)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
###blood haemoglobin (Hb) levels and packed cell volumes (PCV) of 14 female blood bank donors
blood = np.array([

x = blood[:,0]
y = blood[:,1] 

from numpy import arange, array, ones, linalg
A = array([ x, ones(len(x))])

theta = linalg.lstsq(A.T,y)[0] # obtaining the parameters

line = theta[0]*x+theta[1] # regression line


In [3]:
plt.scatter(x,y, label="Experimental Values")
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV) / ?")


Find a line $h_{\Theta}(x)$ that goes "as near as possible" through the data.

So we are looking for values of the parameters $\Theta = \{\theta_0, \theta_1\}$

A quantification for "as near as possible" is given later. Any ideas?

In [4]:
plt.scatter(x,y, label="Experimental Values")
plt.plot(x,line,'r-', label='Fit with $\Theta_0 = $' + str(round(theta[1], 3)) + 
         "  $\Theta_1 = $" + str(round(theta[0],4)) ) 
plt.xlabel("blood haemoglobin(Hb) levels / g/dL")
plt.ylabel("packed cell volumes (PCV)")

Training set

  • $m$: number of training examples
  • $x$: input variable
  • $y$: output variable
  • $(x^{(i)}, y^{(i)})$: $i$-th training example
$$ \mathcal D_{train} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})\} $$

Example data set: Hg-PCV

haemoglobin level ($x$) packed cell volume ($y$)
15.5 0.450
13.6 0.420
13.5 0.440
13.0 0.395
$\dots$ $\dots$

Overview: Training Procedure

  • Model: $h_\Theta (x) = \theta_0 + \theta_1 x$
    • $h_\Theta(x)$ : hypothesis
    • two parameters
  • inference: estimating the model parameters $\bf \theta$ by learning from data
In [5]:
from IPython.display import Image

Lineare regression with one variable (univariate linear regression)

Why the name ``Linear regression with one variable''?

  • One variable: $x$
  • Hypothesis - Model $h_\theta (x) = \theta_0 + \theta_1 x $
  • Hypothesis is linear with respect to the parameters $\theta_0, \theta_1$.
  • Prediction of a floating point number: Regression

  • (Hypothesis is linear with respect to the variable $x$ - not the reason for linear regression)


Linear regression can be solved algebraically (with a pseudo inverse). This is not considered here.

Cost function

Remember we are looking for a strait line that goes "as near as possible" through the data.

We give each possible strait line a value which quantifies the quality.

  • The lower the value the better the "fit".
  • That's the cost function (or error function).

(squared error) cost function :

$$ J_{\mathcal D}(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2 $$

Typical cost function for regression (can be derived from the "Maximum Likelihood Principle".)

Goal: Minimization of the cost

$$ \text{minimize}_{\theta} J(\theta) $$


  • Hypothesis $h_{\Theta}(x)$ is a function of $x$ with fixed parameters $\Theta$.
  • Cost function $J_{\mathcal D}(\Theta)$ is a function of $\Theta$ (for a fixed training data set $\mathcal D$)

We can reflect this fact in the implementation:

In [7]:
def get_linear_hypothesis(theta_0, theta_1):
    return lambda x: theta_0 + theta_1 * x

def get_squared_error_cost_function(x, y, get_hypothesis):
    m = len(x)
    return lambda theta_0, theta_1: 1. / (2. * m) * ((get_hypothesis(theta_0, theta_1)(x) - y)**2).sum()


no intersect ($\theta_0$), i.e. hypothesis is $ h_{\theta_1}(x) = \theta_1 \cdot x $

training data: $\mathcal D = \{ (1, .5), (2, 3), (3, 4) \}$

In [8]:
x = np.array([0.,1., 3.])
y = np.array([0, 2., 4.])

costs = np.zeros([len(np.arange(0.,3.,0.1))])
thetas_1 = np.arange(0.,3.,0.1)
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_1 in enumerate(thetas_1):
    costs[i] = cost_function(0., theta_1)
In [9]:
f, axarr = plt.subplots(1, 2, figsize=(8,4))
axarr[0].scatter(x,y); axarr[0].set_title('Data'); axarr[0].set_xlabel("x"); axarr[0].set_ylabel("y")
axarr[1].plot(thetas_1, costs); axarr[1].set_title('Cost');axarr[1].set_xlabel("$\Theta_1$"); axarr[1].set_ylabel("cost")
<matplotlib.text.Text at 0x110a9c748>
In [10]:
x = y = np.arange(-1.0, 1.0, 0.1)
y = 3. + 2. * x + np.random.randn(len(y)) * 0.2

thetas = [[2., 1.5, 'c'], [2.5,3.,'k'], [3., 2., 'g']]

#computation of the cost-grid                                                                                              
w0s  = np.arange(1.5,4.5,0.02)
w1s = np.arange(.5,3.5,0.02)
cost = np.zeros([len(w0s),len(w1s)])
cost_function = get_squared_error_cost_function(x, y, get_linear_hypothesis)
for i, theta_0 in enumerate(w0s):
    for j, theta_1 in enumerate(w1s):
        cost[i][j] = cost_function(theta_0, theta_1)

X, Y = np.meshgrid(w0s, w1s)
In [11]:
plt.subplot(121);plt.plot(x,y,'+r');plt.xlabel('x');plt.ylabel('y');plt.title('Data and Hypotheses')
for t in thetas:
    h = get_linear_hypothesis(t[0], t[1])(x)
plt.subplot(122);plt.contour(X, Y, cost);plt.xlabel('$\Theta_0$');plt.ylabel('$\Theta_1$');plt.title('Contour Plot')
for t in thetas:
    plt.plot(t[0],t[1],'o'+ t[2])
In [12]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm
vmin = cost.min()
vmax = cost.max()
In [13]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(X, Y, cost, cmap=cm.jet, rstride=5, cstride=5, vmax=vmax, vmin = vmin, antialiased=True )
<matplotlib.text.Text at 0x110eb6128>


The cost function for linear regression is convex, i.e. only one (global) minimum.

Cost function summary

  • Cost is a function of the parameters.
  • The goal is minimization of the cost for finding "good" parameters.
  • Concept of cost function also for other kinds of models, e.g. neuronal networks or k-means clustering.

Gradient descent

Recap: Gradient

The gradient of a scalar function (e.g. cost function) is:

$$ \begin{align*} \vec \nabla J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} J(\theta_0, \theta_1) = \begin{pmatrix} \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} \\ \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} \end{pmatrix} \end{align*} $$

mit dem Nabla-Operator $$ \begin{align*} \vec \nabla = \begin{pmatrix} \frac{\partial }{\partial \theta_0} \\ \frac{\partial }{\partial \theta_1} \end{pmatrix} \end{align*} $$

Recap: Problem formulation

  • Hypothesis: $h_{\Theta}(x) = \theta_0 +\theta_1 \cdot x$
  • Parameters: $\Theta = \{\theta_0 , \theta_1\}$
  • Cost function: $J(\Theta) = \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2$
  • Goal: $\text{minimize}_{\Theta} J(\Theta)$

Gradient Descent Idea

Goal: $\text{minimize}_{\Theta} J(\Theta)$

  1. Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
  2. In each iteration:
    • Change the values of $\Theta$, such that $J(\Theta)$ becomes smaller.
    • Use the gradient for finding new values in each iteration.

Gradient descent procedure

  1. Start with special values for $\Theta$ (start values); for the univariate linear regression: $\Theta = \{\theta_0, \theta_1\}$
  2. Repeat (until a sufficent good minimum was found - stopping criterion):
    • Change the values of $\Theta$ by using the partial derivatives (gradient) with the following update rule: $$ \theta_j^{neu} \leftarrow \theta_j^{alt} - \alpha \frac{\partial}{\partial \theta_j} J(\Theta^{alt}) $$ with the hyper parameter $\alpha>0$ (learning rate)

Note for the implementation: simultaneous update of all parameters

\begin{align*} temp0 &\leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \\ temp1 &\leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \\ \theta_0 &\leftarrow temp0 \\ \theta_1 &\leftarrow temp1 \\ \end{align*}

Partial derivative for $\theta_0$

$$ \theta_0 \leftarrow \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\Theta) $$\begin{align*} \frac{\partial}{\partial \theta_0} J(\Theta) &= \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (h_\Theta(x^{(i)})-y^{(i)})^2 \\ & = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \\ & = \frac{1}{m} \sum_{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \end{align*}

Partial derivative for $\theta_1$

$$ \theta_1 \leftarrow \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\Theta) $$

\begin{align} \frac{\partial}{\partial \theta1} J(\Theta) &= \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (h\Theta(x^{(i)})-y^{(i)})^2 \ & = \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum{i=1}^m (\theta0 + \theta_1 \cdot x^{(i)} - y^{(i)}) ^2 \ & = \frac{1}{m} \sum{i=1}^m (\theta_0 + \theta_1 \cdot x^{(i)} - y^{(i)}) \cdot x^{(i)} \end{align}

Step size

step size depends on two factors:

  • size of the gradient $\vec \nabla J(\Theta)$
  • learning rate $\alpha > 0$ (hyper parameter)

$\alpha$ must be choosen carefully.

Exercise: Implementation of gradient descent for the linear regression

with python and numpy

Use the cost function and the linear hypothesis (see above).

  1. Write a python function for the update of the new $\Theta$ values:
    theta_0, theta_1 = compute_new_theta(x, y, theta_0, theta_1, alpha)
  2. Write a python function that given as input start values for $\theta_0$ and $\theta_1$ iteratively applies the the `compute_new_theta' function to find "good" $\Theta$ values (in a python function).

    • Use the synthetic train data (see below).
    • Plot the progress (cost over iterations) of the training procedure.
  3. Plot the best hypothesis (strait line) together with the data.

  4. Try different values for the hyper parameter $\alpha$ and plot for all the progress (cost value over iterations) in a graph.

In [14]:
# generation of synthetic train data
x_min = -10.
x_max = 10.
m = 10

x = np.random.uniform(x_min, x_max, m)
a = 10.
b = 5.
y_noise_sigma = 4.
y = a + b * x + np.random.randn(m) * y_noise_sigma

plt.plot(x, y, "bo")
<matplotlib.text.Text at 0x110dd7d68>