Multivariate Linear Regression¶

Learning Objectives¶

Multivariate linear regression
Vector implementation
Input Scaling
Concepts of machine learning: learning, cost function, gradient descent
Basis functions

Problem formulation¶

Supervised learning: regression¶

$m$ observations (training examples) with
$n$ Features:
- $i$-th training example
  - ${\bf x}^{(i)}= \{ x_1^{(i)}, x_2^{(i)}, \dots, x_n^{(i)}\}$
  - $x_j^{(i)}$: value of the feature $j$ for the $i$-th example
  - $1 \leq j \leq n$ (univariate case $n=1$):
- $i$-th target value $y^{(i)}$

Goal: prediction of $y$ for a new $\bf x$.

Example Boston data¶

Boston feature names¶

In [7]:

#load data
from sklearn.datasets import load_boston
boston = load_boston()

# feature names
#boston.feature_names    
# features (vec x)
#print(boston.data)
# house price (y)
#print(boston.target)

Linear model¶

linear with respect to the parameters $\theta_j$
(for now) linear with repect to $x_j$

$$ h_\Theta ({x_1, \dots, x_n}) = \theta_0 + \theta_1 \cdot x_1 + \dots \theta_n \cdot x_n = \theta_0 + \sum_{j=1}^n \theta\ _j x_j $$

Example: Visualization¶

Data ($n=2$) and optimal hypothesis (Ebene).

In [6]:

print (boston.DESCR)

# feature names
print(boston.feature_names)
    
# features (vec x)
print(boston.data)

# house price (y)
print(boston.target)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.92830000e+02   4.03000000e+00]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]
[ 24.   21.6  34.7  33.4  36.2  28.7  22.9  27.1  16.5  18.9  15.   18.9
  21.7  20.4  18.2  19.9  23.1  17.5  20.2  18.2  13.6  19.6  15.2  14.5
  15.6  13.9  16.6  14.8  18.4  21.   12.7  14.5  13.2  13.1  13.5  18.9
  20.   21.   24.7  30.8  34.9  26.6  25.3  24.7  21.2  19.3  20.   16.6
  14.4  19.4  19.7  20.5  25.   23.4  18.9  35.4  24.7  31.6  23.3  19.6
  18.7  16.   22.2  25.   33.   23.5  19.4  22.   17.4  20.9  24.2  21.7
  22.8  23.4  24.1  21.4  20.   20.8  21.2  20.3  28.   23.9  24.8  22.9
  23.9  26.6  22.5  22.2  23.6  28.7  22.6  22.   22.9  25.   20.6  28.4
  21.4  38.7  43.8  33.2  27.5  26.5  18.6  19.3  20.1  19.5  19.5  20.4
  19.8  19.4  21.7  22.8  18.8  18.7  18.5  18.3  21.2  19.2  20.4  19.3
  22.   20.3  20.5  17.3  18.8  21.4  15.7  16.2  18.   14.3  19.2  19.6
  23.   18.4  15.6  18.1  17.4  17.1  13.3  17.8  14.   14.4  13.4  15.6
  11.8  13.8  15.6  14.6  17.8  15.4  21.5  19.6  15.3  19.4  17.   15.6
  13.1  41.3  24.3  23.3  27.   50.   50.   50.   22.7  25.   50.   23.8
  23.8  22.3  17.4  19.1  23.1  23.6  22.6  29.4  23.2  24.6  29.9  37.2
  39.8  36.2  37.9  32.5  26.4  29.6  50.   32.   29.8  34.9  37.   30.5
  36.4  31.1  29.1  50.   33.3  30.3  34.6  34.9  32.9  24.1  42.3  48.5
  50.   22.6  24.4  22.5  24.4  20.   21.7  19.3  22.4  28.1  23.7  25.
  23.3  28.7  21.5  23.   26.7  21.7  27.5  30.1  44.8  50.   37.6  31.6
  46.7  31.5  24.3  31.7  41.7  48.3  29.   24.   25.1  31.5  23.7  23.3
  22.   20.1  22.2  23.7  17.6  18.5  24.3  20.5  24.5  26.2  24.4  24.8
  29.6  42.8  21.9  20.9  44.   50.   36.   30.1  33.8  43.1  48.8  31.
  36.5  22.8  30.7  50.   43.5  20.7  21.1  25.2  24.4  35.2  32.4  32.
  33.2  33.1  29.1  35.1  45.4  35.4  46.   50.   32.2  22.   20.1  23.2
  22.3  24.8  28.5  37.3  27.9  23.9  21.7  28.6  27.1  20.3  22.5  29.
  24.8  22.   26.4  33.1  36.1  28.4  33.4  28.2  22.8  20.3  16.1  22.1
  19.4  21.6  23.8  16.2  17.8  19.8  23.1  21.   23.8  23.1  20.4  18.5
  25.   24.6  23.   22.2  19.3  22.6  19.8  17.1  19.4  22.2  20.7  21.1
  19.5  18.5  20.6  19.   18.7  32.7  16.5  23.9  31.2  17.5  17.2  23.1
  24.5  26.6  22.9  24.1  18.6  30.1  18.2  20.6  17.8  21.7  22.7  22.6
  25.   19.9  20.8  16.8  21.9  27.5  21.9  23.1  50.   50.   50.   50.
  50.   13.8  13.8  15.   13.9  13.3  13.1  10.2  10.4  10.9  11.3  12.3
   8.8   7.2  10.5   7.4  10.2  11.5  15.1  23.2   9.7  13.8  12.7  13.1
  12.5   8.5   5.    6.3   5.6   7.2  12.1   8.3   8.5   5.   11.9  27.9
  17.2  27.5  15.   17.2  17.9  16.3   7.    7.2   7.5  10.4   8.8   8.4
  16.7  14.2  20.8  13.4  11.7   8.3  10.2  10.9  11.    9.5  14.5  14.1
  16.1  14.3  11.7  13.4   9.6   8.7   8.4  12.8  10.5  17.1  18.4  15.4
  10.8  11.8  14.9  12.6  14.1  13.   13.4  15.2  16.1  17.8  14.9  14.1
  12.7  13.5  14.9  20.   16.4  17.7  19.5  20.2  21.4  19.9  19.   19.1
  19.1  20.1  19.9  19.6  23.2  29.8  13.8  13.3  16.7  12.   14.6  21.4
  23.   23.7  25.   21.8  20.6  21.2  19.1  20.6  15.2   7.    8.1  13.6
  20.1  21.8  24.5  23.1  19.7  18.3  21.2  17.5  16.8  22.4  20.6  23.9
  22.   11.9]

Vector formulation¶

Hypothesis: $h_{\theta}(\vec{x}) = \vec{\theta}^T \cdot \vec{x} = \theta_0 x_0 + \theta_1 x_1 + \dots \theta_n x_n $

with $x_0 = 1$
$n+1$ parameter: ${\vec{\theta}}^T = ( \theta_0, \theta_1, \dots , \theta_n ) $

for convenience we use $\theta$ as symbol for all $\{ \theta_0, \theta_1, \dots \}$

Cost function and gradient descent¶

Minimization of the cost function $J$¶

$$ J({\theta}) = \frac{1}{2 m} \sum_{i=1}^m (h_{\theta}(\vec{x}^{(i)}) - y^{(i)})^2 $$

Recap: Gradient Descent¶

Repeat until convergence:

$$ \theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) $$

Vector form of the update rule¶

with the definition of the gradient $$ grad(J(\theta)) = \vec \nabla J(\theta) = \left( \begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \dots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{array}\right) $$

$$ \vec \theta^{neu} \leftarrow \vec \theta^{alt} - \alpha \cdot grad(J(\theta^{alt})) $$

Partial derivates for linear regression¶

\begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m (h_\theta(\vec x^{(i)}) - y^{(i)})^2 \\ & = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - y^{(i)}) ^2 \end{align*}

So we have update rules for all $\theta_j$ ($0 \leq j \leq n$):

$$ \theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - y^{(i)}) x_j^{(i)\ } $$

Vector form of the update rule for linear regression¶

$$ \vec \theta^{new} \leftarrow \vec \theta^{old} - \alpha \frac{1}{m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - \ y^{(i)}) \vec{x}^{(i)} $$

Feature Scaling¶

If the features have different order of magnitude: the learning will be very slow (see e.g. here on pages 26-30).

A remedy is rescaling of the features.

Idea: values of all {\it Features $x_j$) should be in the range: $$ -1 \lesssim x_j \lesssim 1 $$

Z-transformation¶

$$ x_j' = \frac{x_j-\mu_j}{std(x_j)} $$

with

mean of $x_j$: $\mu_j = \overline{x_j}= \sum_i x_j^{(i)} / m $
standard deviation of $x_j$: $std(x_j) = \sqrt{var(x_j)}$
variance of $x_j$: $var(x_j) = \overline{(x_j - \mu_j)^2} = \overline{x_j^2} - \mu_j^2$

For all features $x_j'$ the mean is 0 and the standard deviation is 1.

Implementation and practice¶

Data matrix (design matrix)¶

Train data as matrix ${\bf X}$ with $X_{ij} \equiv x_j^{(i)}$

each rows $i$ corresponds to a training example $\vec x^{(i)}$
each column $j$ corresponds to the $j$-th feature
- with $x_0^{(i)} = 1$

Prediction can be done for all data by:

$$\vec{h}({\bf X}) = {\bf X} \cdot \vec{\theta}$$

In [23]:

import numpy as np

# dummy values
X = np.random.randn(700,4)
theta = np.array([2.,3.,4.,5.])

Out[23]:

(700, 4)

In [24]:

# e.g.: with 3 features and 700 training data with numpy:
print (X.shape)
print (theta.shape)
h = X.dot(theta)
print (h.shape)

(700, 4)
(4,)
(700,)

Vectorisierte Form of the update rule with Data Matrix $\bf X$¶

From the vectorised form of the update rule

$$ \vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} \sum_{i=1}^m (h(\vec x^{(i)}) - y^{(i)}) \vec{x}^\ {(i)} $$

we get with the data matrix ${\bf X}$ $$ \vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} {\bf X^T} \cdot ( \vec{h}({\bf X}) - \vec y) $$

in numpy:

 theta = theta - alpha * (1.0 / m) * X.T.dot(h - y)

For testing if the gradient descent implementation works¶

Remember the goal is to find the $\text{min}_\theta J(\theta)$

Plot $J(\theta)$ over the iterations: $J(\theta)$ must become smaller in each iteration (for full batch learning).

Learning Rate $\alpha$¶

How to choose $\alpha$?

Look at the progress during learning for different values of $\alpha$!

to small values for $\alpha$:

to large values for $\alpha$:

If $\alpha$ is to small $\Rightarrow$ smale convergence.

If $\alpha$ is to large $\Rightarrow$ $J$ grows or (oscillation or suboptimal progress). Wenn α zu groÿ ist, eventuell keine Konvergenz:

Try different $\alpha$'s (on log-space) or e.g.: 0.003, 0.006, 0.009, 0.03, 0.06, 0.09, $\dots$

Note: With feature scaling learning yield modified parameters : $\theta_0'$ und $\theta_1'$

\begin{align*} h(x) & = {\theta_0}' +{\theta_1}' x' \\ & = {\theta_0}' +{\theta_1}' \cdot \frac{x - \mu}{\sigma_x} \\ & = ({\theta_0}' - {\theta_1}' \cdot \frac{\mu}{\sigma_x} )+ \frac{{\theta_1}'}{\sigma_x} \cdot x \end{align*}

i.e. \begin{align} \theta_0 & = \Theta_0' - \theta_1' \cdot \frac{\mu}{\sigma_x} \ \theta_1 & = \frac{\theta_1'}{\sigma_x} \end{align}

Basis functions¶

So far: The model is linear in respect to the input:

$$ h_\theta ({\vec x}) = \theta_0 + \sum_{j=1}^n \theta_j x_j $$

Extension: Replace the $x_j$ with basis functions $\phi_k ( {\vec x}) $:

$$ h_\theta (\vec x) = \theta_0 + \sum_{k=1}^{n'} \theta_k \phi_k ( {\vec x}) $$

Still a linear model. It's linear with respect to the parameters $\theta_j$

Examples for basis functions¶

Polynoms: $$ \phi_k ( {\vec x}) = x_1^2 $$

$$ \phi_{k'} ( {\vec x}) = x_1 \cdot x_3 $$

``Gaussian Basis Functions''

$$ \phi_{k''} ( {\vec x}) = \exp\{ - \frac{(x-\mu_j)^2}{2 \sigma_j^2}\} $$

(New) Features¶

Transformation of the raw data $\vec{x_i}$ to features $\vec \phi(\vec x_i)$

Example: prediction of the price of land:

raw data: length $x_1$ and width $x_2$. \
instead: $h_\theta = \theta_0 + \theta_1 \cdot length + \theta_2 \cdot width $ \ \vspace{1cm}
use the ``area'' as feature $\phi_1$: $area = length \cdot width$

$$h'_\theta = \theta_0' + \theta_1' \cdot area $$

Polynominal regression¶

Exercise: Vector implementation of gradient descent¶

Note:

The implementation should work with an arbitrary number of features.

Generate artificial data for the design matrix $\bf X$ with two features $x_1, x_2$ (or three with fixed $x_0=1$)

Implement a get_linear_hypothesis function:

>theta = np.array([1.1, 2.0, -.9]) 
> h = get_linear_hypothesis(theta) 
> print (h(X)) 
array([ -0.99896965,  20.71147926, ...

Target values:
1. Use the get_linear_hypothesis function for generating y-values (add gaussian noise)
2. Plot the x1-x2-y data points in a 3D-plot, see http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html

Implement the 'get_cost_function(x, y)' python function:

> j = get_cost_function(X, y) 
> print (j(theta))
> 401.20  # depends on X and y

Implement gradient descent.
1. Implement a function for the update rule:
  
  > theta = compute_new_theta(x, y, theta, alpha)
2. Implement a function gradient_descent(alpha, theta, X, y). theta are the start values for $\vec \theta$. The function applies iteratively compute_new_theta. Use an arbitrary stoping criterion.
Plot the progress (cost value over iterations) für 5B
Plot the optimal hypothesis in a graph with the data (see 3B).

Literature:¶

Andrew Ng: Machine Learning. Openclassroom Stanford University, 2013
C. Bishop: Pattern recognition and Machine Learning, Springer Verlag 2006