## Multivariate Linear Regression¶

#### Learning Objectives¶

• Multivariate linear regression
• Vector implementation
• Input Scaling
• Concepts of machine learning: learning, cost function, gradient descent
• Basis functions

### Problem formulation¶

#### Supervised learning: regression¶

• $m$ observations (training examples) with
• $n$ Features:
• $i$-th training example
• ${\bf x}^{(i)}= \{ x_1^{(i)}, x_2^{(i)}, \dots, x_n^{(i)}\}$
• $x_j^{(i)}$: value of the feature $j$ for the $i$-th example
• $1 \leq j \leq n$ (univariate case $n=1$):
• $i$-th target value $y^{(i)}$

Goal: prediction of $y$ for a new $\bf x$.

#### Boston feature names¶

In [7]:
#load data
from sklearn.datasets import load_boston

# feature names
#boston.feature_names
# features (vec x)
#print(boston.data)
# house price (y)
#print(boston.target)


#### Linear model¶

• linear with respect to the parameters $\theta_j$
• (for now) linear with repect to $x_j$
$$h_\Theta ({x_1, \dots, x_n}) = \theta_0 + \theta_1 \cdot x_1 + \dots \theta_n \cdot x_n = \theta_0 + \sum_{j=1}^n \theta\ _j x_j$$

#### Example: Visualization¶

Data ($n=2$) and optimal hypothesis (Ebene).

In [6]:
print (boston.DESCR)

# feature names
print(boston.feature_names)

# features (vec x)
print(boston.data)

# house price (y)
print(boston.target)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in$1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
3.96900000e+02   4.98000000e+00]
[  2.73100000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
3.96900000e+02   9.14000000e+00]
[  2.72900000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
3.92830000e+02   4.03000000e+00]
...,
[  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
3.96900000e+02   5.64000000e+00]
[  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
3.93450000e+02   6.48000000e+00]
[  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
3.96900000e+02   7.88000000e+00]]
[ 24.   21.6  34.7  33.4  36.2  28.7  22.9  27.1  16.5  18.9  15.   18.9
21.7  20.4  18.2  19.9  23.1  17.5  20.2  18.2  13.6  19.6  15.2  14.5
15.6  13.9  16.6  14.8  18.4  21.   12.7  14.5  13.2  13.1  13.5  18.9
20.   21.   24.7  30.8  34.9  26.6  25.3  24.7  21.2  19.3  20.   16.6
14.4  19.4  19.7  20.5  25.   23.4  18.9  35.4  24.7  31.6  23.3  19.6
18.7  16.   22.2  25.   33.   23.5  19.4  22.   17.4  20.9  24.2  21.7
22.8  23.4  24.1  21.4  20.   20.8  21.2  20.3  28.   23.9  24.8  22.9
23.9  26.6  22.5  22.2  23.6  28.7  22.6  22.   22.9  25.   20.6  28.4
21.4  38.7  43.8  33.2  27.5  26.5  18.6  19.3  20.1  19.5  19.5  20.4
19.8  19.4  21.7  22.8  18.8  18.7  18.5  18.3  21.2  19.2  20.4  19.3
22.   20.3  20.5  17.3  18.8  21.4  15.7  16.2  18.   14.3  19.2  19.6
23.   18.4  15.6  18.1  17.4  17.1  13.3  17.8  14.   14.4  13.4  15.6
11.8  13.8  15.6  14.6  17.8  15.4  21.5  19.6  15.3  19.4  17.   15.6
13.1  41.3  24.3  23.3  27.   50.   50.   50.   22.7  25.   50.   23.8
23.8  22.3  17.4  19.1  23.1  23.6  22.6  29.4  23.2  24.6  29.9  37.2
39.8  36.2  37.9  32.5  26.4  29.6  50.   32.   29.8  34.9  37.   30.5
36.4  31.1  29.1  50.   33.3  30.3  34.6  34.9  32.9  24.1  42.3  48.5
50.   22.6  24.4  22.5  24.4  20.   21.7  19.3  22.4  28.1  23.7  25.
23.3  28.7  21.5  23.   26.7  21.7  27.5  30.1  44.8  50.   37.6  31.6
46.7  31.5  24.3  31.7  41.7  48.3  29.   24.   25.1  31.5  23.7  23.3
22.   20.1  22.2  23.7  17.6  18.5  24.3  20.5  24.5  26.2  24.4  24.8
29.6  42.8  21.9  20.9  44.   50.   36.   30.1  33.8  43.1  48.8  31.
36.5  22.8  30.7  50.   43.5  20.7  21.1  25.2  24.4  35.2  32.4  32.
33.2  33.1  29.1  35.1  45.4  35.4  46.   50.   32.2  22.   20.1  23.2
22.3  24.8  28.5  37.3  27.9  23.9  21.7  28.6  27.1  20.3  22.5  29.
24.8  22.   26.4  33.1  36.1  28.4  33.4  28.2  22.8  20.3  16.1  22.1
19.4  21.6  23.8  16.2  17.8  19.8  23.1  21.   23.8  23.1  20.4  18.5
25.   24.6  23.   22.2  19.3  22.6  19.8  17.1  19.4  22.2  20.7  21.1
19.5  18.5  20.6  19.   18.7  32.7  16.5  23.9  31.2  17.5  17.2  23.1
24.5  26.6  22.9  24.1  18.6  30.1  18.2  20.6  17.8  21.7  22.7  22.6
25.   19.9  20.8  16.8  21.9  27.5  21.9  23.1  50.   50.   50.   50.
50.   13.8  13.8  15.   13.9  13.3  13.1  10.2  10.4  10.9  11.3  12.3
8.8   7.2  10.5   7.4  10.2  11.5  15.1  23.2   9.7  13.8  12.7  13.1
12.5   8.5   5.    6.3   5.6   7.2  12.1   8.3   8.5   5.   11.9  27.9
17.2  27.5  15.   17.2  17.9  16.3   7.    7.2   7.5  10.4   8.8   8.4
16.7  14.2  20.8  13.4  11.7   8.3  10.2  10.9  11.    9.5  14.5  14.1
16.1  14.3  11.7  13.4   9.6   8.7   8.4  12.8  10.5  17.1  18.4  15.4
10.8  11.8  14.9  12.6  14.1  13.   13.4  15.2  16.1  17.8  14.9  14.1
12.7  13.5  14.9  20.   16.4  17.7  19.5  20.2  21.4  19.9  19.   19.1
19.1  20.1  19.9  19.6  23.2  29.8  13.8  13.3  16.7  12.   14.6  21.4
23.   23.7  25.   21.8  20.6  21.2  19.1  20.6  15.2   7.    8.1  13.6
20.1  21.8  24.5  23.1  19.7  18.3  21.2  17.5  16.8  22.4  20.6  23.9
22.   11.9]


#### Vector formulation¶

Hypothesis: $h_{\theta}(\vec{x}) = \vec{\theta}^T \cdot \vec{x} = \theta_0 x_0 + \theta_1 x_1 + \dots \theta_n x_n$

• with $x_0 = 1$
• $n+1$ parameter: ${\vec{\theta}}^T = ( \theta_0, \theta_1, \dots , \theta_n )$

for convenience we use $\theta$ as symbol for all $\{ \theta_0, \theta_1, \dots \}$

### Cost function and gradient descent¶

#### Minimization of the cost function $J$¶

$$J({\theta}) = \frac{1}{2 m} \sum_{i=1}^m (h_{\theta}(\vec{x}^{(i)}) - y^{(i)})^2$$

#### Recap: Gradient Descent¶

Repeat until convergence:

$$\theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$

#### Vector form of the update rule¶

with the definition of the gradient $$grad(J(\theta)) = \vec \nabla J(\theta) = \left( \begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \dots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{array}\right)$$

$$\vec \theta^{neu} \leftarrow \vec \theta^{alt} - \alpha \cdot grad(J(\theta^{alt}))$$

#### Partial derivates for linear regression¶

\begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m (h_\theta(\vec x^{(i)}) - y^{(i)})^2 \\ & = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - y^{(i)}) ^2 \end{align*}

So we have update rules for all $\theta_j$ ($0 \leq j \leq n$):

$$\theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - y^{(i)}) x_j^{(i)\ }$$

#### Vector form of the update rule for linear regression¶

$$\vec \theta^{new} \leftarrow \vec \theta^{old} - \alpha \frac{1}{m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - \ y^{(i)}) \vec{x}^{(i)}$$

#### Feature Scaling¶

If the features have different order of magnitude: the learning will be very slow (see e.g. here on pages 26-30).

A remedy is rescaling of the features.

Idea: values of all {\it Features $x_j$) should be in the range: $$-1 \lesssim x_j \lesssim 1$$

#### Z-transformation¶

$$x_j' = \frac{x_j-\mu_j}{std(x_j)}$$

with

• mean of $x_j$: $\mu_j = \overline{x_j}= \sum_i x_j^{(i)} / m$
• standard deviation of $x_j$: $std(x_j) = \sqrt{var(x_j)}$
• variance of $x_j$: $var(x_j) = \overline{(x_j - \mu_j)^2} = \overline{x_j^2} - \mu_j^2$

For all features $x_j'$ the mean is 0 and the standard deviation is 1.

### Implementation and practice¶

#### Data matrix (design matrix)¶

Train data as matrix ${\bf X}$ with $X_{ij} \equiv x_j^{(i)}$

• each rows $i$ corresponds to a training example $\vec x^{(i)}$
• each column $j$ corresponds to the $j$-th feature
• with $x_0^{(i)} = 1$

Prediction can be done for all data by:

$$\vec{h}({\bf X}) = {\bf X} \cdot \vec{\theta}$$
In [23]:
import numpy as np

# dummy values
X = np.random.randn(700,4)
theta = np.array([2.,3.,4.,5.])

Out[23]:
(700, 4)
In [24]:
# e.g.: with 3 features and 700 training data with numpy:
print (X.shape)
print (theta.shape)
h = X.dot(theta)
print (h.shape)

(700, 4)
(4,)
(700,)


#### Vectorisierte Form of the update rule with Data Matrix $\bf X$¶

From the vectorised form of the update rule

$$\vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} \sum_{i=1}^m (h(\vec x^{(i)}) - y^{(i)}) \vec{x}^\ {(i)}$$

we get with the data matrix ${\bf X}$ $$\vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} {\bf X^T} \cdot ( \vec{h}({\bf X}) - \vec y)$$

in numpy:

 theta = theta - alpha * (1.0 / m) * X.T.dot(h - y)

#### For testing if the gradient descent implementation works¶

Remember the goal is to find the $\text{min}_\theta J(\theta)$

Plot $J(\theta)$ over the iterations: $J(\theta)$ must become smaller in each iteration (for full batch learning).

#### Learning Rate $\alpha$¶

How to choose $\alpha$?

Look at the progress during learning for different values of $\alpha$!

to small values for $\alpha$:

to large values for $\alpha$:

If $\alpha$ is to small $\Rightarrow$ smale convergence.

If $\alpha$ is to large $\Rightarrow$ $J$ grows or (oscillation or suboptimal progress). Wenn α zu groÿ ist, eventuell keine Konvergenz:

Try different $\alpha$'s (on log-space) or e.g.: 0.003, 0.006, 0.009, 0.03, 0.06, 0.09, $\dots$

Note: With feature scaling learning yield modified parameters : $\theta_0'$ und $\theta_1'$

\begin{align*} h(x) & = {\theta_0}' +{\theta_1}' x' \\ & = {\theta_0}' +{\theta_1}' \cdot \frac{x - \mu}{\sigma_x} \\ & = ({\theta_0}' - {\theta_1}' \cdot \frac{\mu}{\sigma_x} )+ \frac{{\theta_1}'}{\sigma_x} \cdot x \end{align*}

i.e. \begin{align} \theta_0 & = \Theta_0' - \theta_1' \cdot \frac{\mu}{\sigma_x} \ \theta_1 & = \frac{\theta_1'}{\sigma_x} \end{align}

### Basis functions¶

So far: The model is linear in respect to the input:

$$h_\theta ({\vec x}) = \theta_0 + \sum_{j=1}^n \theta_j x_j$$

Extension: Replace the $x_j$ with basis functions $\phi_k ( {\vec x})$:

$$h_\theta (\vec x) = \theta_0 + \sum_{k=1}^{n'} \theta_k \phi_k ( {\vec x})$$

Still a linear model. It's linear with respect to the parameters $\theta_j$

#### Examples for basis functions¶

Polynoms: $$\phi_k ( {\vec x}) = x_1^2$$

$$\phi_{k'} ( {\vec x}) = x_1 \cdot x_3$$

Gaussian Basis Functions''

$$\phi_{k''} ( {\vec x}) = \exp\{ - \frac{(x-\mu_j)^2}{2 \sigma_j^2}\}$$

#### (New) Features¶

• Transformation of the raw data $\vec{x_i}$ to features $\vec \phi(\vec x_i)$

Example: prediction of the price of land:

• raw data: length $x_1$ and width $x_2$. \

• instead: $h_\theta = \theta_0 + \theta_1 \cdot length + \theta_2 \cdot width$ \ \vspace{1cm}

• use the area'' as feature $\phi_1$: $area = length \cdot width$

$$h'_\theta = \theta_0' + \theta_1' \cdot area$$

#### Exercise: Vector implementation of gradient descent¶

Note:

• The implementation should work with an arbitrary number of features.
1. Generate artificial data for the design matrix $\bf X$ with two features $x_1, x_2$ (or three with fixed $x_0=1$)
2. Implement a get_linear_hypothesis function:
>theta = np.array([1.1, 2.0, -.9])
> h = get_linear_hypothesis(theta)
> print (h(X))
array([ -0.99896965,  20.71147926, ...
3. Target values:
1. Use the get_linear_hypothesis function for generating y-values (add gaussian noise)
2. Plot the x1-x2-y data points in a 3D-plot, see http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html
4. Implement the 'get_cost_function(x, y)' python function:

> j = get_cost_function(X, y)
> print (j(theta))
> 401.20  # depends on X and y
5. Implement gradient descent.

1. Implement a function for the update rule:

> theta = compute_new_theta(x, y, theta, alpha)

2. Implement a function gradient_descent(alpha, theta, X, y). theta are the start values for $\vec \theta$. The function applies iteratively compute_new_theta. Use an arbitrary stoping criterion.

6. Plot the progress (cost value over iterations) für 5B

7. Plot the optimal hypothesis in a graph with the data (see 3B).