Goal: prediction of $y$ for a new $\bf x$.
#load data
from sklearn.datasets import load_boston
boston = load_boston()
# feature names
#boston.feature_names
# features (vec x)
#print(boston.data)
# house price (y)
#print(boston.target)
Data ($n=2$) and optimal hypothesis (Ebene).
print (boston.DESCR)
# feature names
print(boston.feature_names)
# features (vec x)
print(boston.data)
# house price (y)
print(boston.target)
Hypothesis: $h_{\theta}(\vec{x}) = \vec{\theta}^T \cdot \vec{x} = \theta_0 x_0 + \theta_1 x_1 + \dots \theta_n x_n $
for convenience we use $\theta$ as symbol for all $\{ \theta_0, \theta_1, \dots \}$
Repeat until convergence:
$$ \theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) $$with the definition of the gradient $$ grad(J(\theta)) = \vec \nabla J(\theta) = \left( \begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \dots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{array}\right) $$
$$ \vec \theta^{neu} \leftarrow \vec \theta^{alt} - \alpha \cdot grad(J(\theta^{alt})) $$
So we have update rules for all $\theta_j$ ($0 \leq j \leq n$):
$$ \theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (\vec \theta^T \cdot \vec x^{(i)} - y^{(i)}) x_j^{(i)\ } $$If the features have different order of magnitude: the learning will be very slow (see e.g. here on pages 26-30).
A remedy is rescaling of the features.
Idea: values of all {\it Features $x_j$) should be in the range: $$ -1 \lesssim x_j \lesssim 1 $$
with
For all features $x_j'$ the mean is 0 and the standard deviation is 1.
Train data as matrix ${\bf X}$ with $X_{ij} \equiv x_j^{(i)}$
Prediction can be done for all data by:
$$\vec{h}({\bf X}) = {\bf X} \cdot \vec{\theta}$$import numpy as np
# dummy values
X = np.random.randn(700,4)
theta = np.array([2.,3.,4.,5.])
# e.g.: with 3 features and 700 training data with numpy:
print (X.shape)
print (theta.shape)
h = X.dot(theta)
print (h.shape)
From the vectorised form of the update rule
$$ \vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} \sum_{i=1}^m (h(\vec x^{(i)}) - y^{(i)}) \vec{x}^\ {(i)} $$we get with the data matrix ${\bf X}$ $$ \vec \Theta^{new} \leftarrow \vec \Theta^{old} - \alpha \frac{1}{m} {\bf X^T} \cdot ( \vec{h}({\bf X}) - \vec y) $$
in numpy:
theta = theta - alpha * (1.0 / m) * X.T.dot(h - y)
Remember the goal is to find the $\text{min}_\theta J(\theta)$
Plot $J(\theta)$ over the iterations: $J(\theta)$ must become smaller in each iteration (for full batch learning).
How to choose $\alpha$?
Look at the progress during learning for different values of $\alpha$!
to small values for $\alpha$:
to large values for $\alpha$:
If $\alpha$ is to small $\Rightarrow$ smale convergence.
If $\alpha$ is to large $\Rightarrow$ $J$ grows or (oscillation or suboptimal progress). Wenn α zu groÿ ist, eventuell keine Konvergenz:
Try different $\alpha$'s (on log-space) or e.g.: 0.003, 0.006, 0.009, 0.03, 0.06, 0.09, $\dots$
Note: With feature scaling learning yield modified parameters : $\theta_0'$ und $\theta_1'$
\begin{align*} h(x) & = {\theta_0}' +{\theta_1}' x' \\ & = {\theta_0}' +{\theta_1}' \cdot \frac{x - \mu}{\sigma_x} \\ & = ({\theta_0}' - {\theta_1}' \cdot \frac{\mu}{\sigma_x} )+ \frac{{\theta_1}'}{\sigma_x} \cdot x \end{align*}i.e. \begin{align} \theta_0 & = \Theta_0' - \theta_1' \cdot \frac{\mu}{\sigma_x} \ \theta_1 & = \frac{\theta_1'}{\sigma_x} \end{align}
So far: The model is linear in respect to the input:
$$ h_\theta ({\vec x}) = \theta_0 + \sum_{j=1}^n \theta_j x_j $$Extension: Replace the $x_j$ with basis functions $\phi_k ( {\vec x}) $:
$$ h_\theta (\vec x) = \theta_0 + \sum_{k=1}^{n'} \theta_k \phi_k ( {\vec x}) $$Still a linear model. It's linear with respect to the parameters $\theta_j$
Polynoms: $$ \phi_k ( {\vec x}) = x_1^2 $$
$$ \phi_{k'} ( {\vec x}) = x_1 \cdot x_3 $$``Gaussian Basis Functions''
$$ \phi_{k''} ( {\vec x}) = \exp\{ - \frac{(x-\mu_j)^2}{2 \sigma_j^2}\} $$Example: prediction of the price of land:
raw data: length $x_1$ and width $x_2$. \
instead: $h_\theta = \theta_0 + \theta_1 \cdot length + \theta_2 \cdot width $ \ \vspace{1cm}
use the ``area'' as feature $\phi_1$: $area = length \cdot width$
$$h'_\theta = \theta_0' + \theta_1' \cdot area $$
Note:
get_linear_hypothesis
function:
>theta = np.array([1.1, 2.0, -.9])
> h = get_linear_hypothesis(theta)
> print (h(X))
array([ -0.99896965, 20.71147926, ...
get_linear_hypothesis
function for generating y-values (add gaussian noise) Implement the 'get_cost_function(x, y)' python function:
> j = get_cost_function(X, y)
> print (j(theta))
> 401.20 # depends on X and y
Implement gradient descent.
Implement a function for the update rule:
> theta = compute_new_theta(x, y, theta, alpha)
Implement a function gradient_descent(alpha, theta, X, y)
.
theta
are the start values for $\vec \theta$. The function applies iteratively compute_new_theta
. Use an arbitrary stoping criterion.
Plot the progress (cost value over iterations) für 5B
Plot the optimal hypothesis in a graph with the data (see 3B).