### Model complexity and generalization¶

#### Learning Objectives¶

Concepts of machine learning

• Target function
• Overfitting, underfitting
• Generalization
• Out-of-sample error and test data
• Model complexity

### Target function and noise¶

For prediction: We want to approximate an unknown function, the target function $t(\vec x)$.

If the targets $y$ would be deterministically determined from the $\vec x$ values:

$$y = t(\vec {x})$$

But $y$ is typically noisy. We can model this by:

$$y^{(i)} = t(\vec {x}^{(i)}) + \epsilon$$

with a random variable $\epsilon$ (= stochastic noise)

Sidenote:

Instead of a function we now a a (conditional) probability distribution: $$p(y \mid \vec {x})$$

The probability distribution of the noise can depend on $\vec {x}$: $p(\epsilon \mid \vec {x})$.

Goal:

We are looking for "good" hypotheses. $$h(\vec x) \approx t(\vec x)$$ for all $\vec{x}$ of interest, i.e. for all possible $\vec{x}$ we want to predict. ($p(\vec x)$ suffient large).

Training data is noisy: $y^{(i)} = t(\vec {x}^{(i)}) + \epsilon$

Therefore $h(\vec{x}^{(i)})$ should not perfectly match $y^{(i)}$ of the training data.
But how good should be the correspondence?

#### Sampling Error¶

• Which pattern are caused by the target function?
• Which pattern are caused by the (random chosen) train data?

### Overfitting and Underfitting¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import numpy as np

def test_func(f, x, err=1.):
return np.random.normal(f(x), err)

def compute_error(x, y, p):
yfit = np.polyval(p, x)
return np.sqrt(np.mean((y - yfit) ** 2))

def plot_under_overfit_polynoms():
N = 10
np.random.seed(45)
x = np.linspace(-4, 3., N)
quadratic_func = lambda x: -2 + 3. * x + -2. * x**2

xfit = np.linspace(-4, 3., 1000)

degrees = [1, 2, 9]
titles = ['d = 1 (under-fit)', 'd = 2', 'd = '+ str(degrees[2]) + ' (over-fit)']

plt.figure(figsize = (9, 3.5))
bottom=0.15, top=0.85,
wspace=0.05)
yticks = [-50,-40,-30,-20,-10, 0, 10]
for i, d in enumerate(degrees):
plt.subplot(131 + i, yticks=yticks)
yticks=[]
plt.scatter(x, y, marker='x', c='k', s=50)

p = np.polyfit(x, y, d)
yfit = np.polyval(p, xfit)
plt.plot(xfit, yfit, '-b')

#pl.xlim(-0.2, 1.2)
plt.ylim(-55, 15)
plt.xlabel(u'x')
if i == 0:
plt.ylabel('y')
plt.title(titles[i])


#### Example: Linear Regression with polynominal basis functions¶

In [3]:
plot_under_overfit_polynoms()


#### Goal of learning - Hypothesis Set $\mathcal{H}$}¶

Goal of learning: We are looking for a appropriate hypothesis $$h(\vec x) \approx t(\vec x)$$

$h$ is determined from the hypothesis set $\mathcal{H}$ by training.

Example Univariate Linear Regression'':

$\mathcal H$ is the set of all (infinit) hypotheses: $h_\theta(x) = \theta_0 + \theta_1 x$ (by variation of $\theta_0$ and $\theta_1$).

From the set $\mathcal{H}$ we get through training a $\theta$, i.e. a hypothesis $h_{\theta_{final}}({x})$.

$h_{\theta_{final}}({\bf x})$ is given by the minimum of the cost function:

$\forall \theta: J(\theta_{final}) \leq J(\theta)$

#### Underfitting¶

Recap: Goal is $h(\vec x) \approx t(\vec x)$.

What happens in the underfitting case?

The goal can not be reached: No hypothesis of $\mathcal{H}$ is similar enough to $t(\vec x)$.

The hypothesis set $\mathcal{H}$ has not suffient complexity.

#### Overfitting¶

Recap: Goal is $h(\vec x) \approx t(\vec x)$.

What happens in the overfitting case?

The trained hypothesis $h_{\theta_{final}}({x})$ is adaped to much on the training data: also on the noise pattern.

The hypothesis set $\mathcal{H}$ has to much complexity.

### Test data¶

#### Training error¶

Up to now only a training set $\mathcal{D}_{train}$ and calculation of the average loss of the traing data (cost function):

Training error (in-sample error $E_{in}(h)$):

$$E_{in}(h) = \frac{1}{m} \sum_{i=0}^{m} loss(h(\vec x^{(i)}), y^{(i)})$$

Note: We consider $E_{in}(h)$ now as a function of possible hypotheses $h$. We want to compare different models.

The low training error is not suffient to judge a "good" model.

Overfitting!

$h(\vec x)$ can make bad predictions for $\vec x$ values not in the training set.

#### Out-of-sample error¶

The out-of-sample error is the average loss of typical (new) data ($p({\vec x}', y')$ suffient large):

$$E_{out}(h) = \mathbb{E}_{\bf x,y}\left[loss(h({\vec x}), y)\right] = \int\limits_{\mathcal{X \times Y}} loss(h({\vec x}), y) dp({\vec x},y)$$

#### Generalization error¶

The generalization error of a hypothesis $h$ can defined as (after [Abu]):

$$| E_{out}(h) - E_{in}(h)|$$

#### Test data¶

Problem: We can compute the out-of-sample error $E_{out}$ directly.

Take $m_{test}$ labeled test data $\mathcal{D}_{test}$ to check if $h(\vec x)$ predicts good values even for $\vec x$ values not used in training.

$$\mathcal{D}_{test}= \{ ( {\vec {x'}}^{(0)}, {y'}^{(0)} ), ( {\vec {x'}}^{(1)}, {y'}^{(1)} ), \dots, ( {\vec {x'}}^{(m_{test})}, {y'}^{(m_{test})} ) \}$$

#### Test error¶

The test error is the average loss of the test data:

$$testerror = \frac{1}{m'} \sum_{i=0}^{m_{test}} loss(h({\vec {x'}}^{(i)}), {y'}^{(i)})$$

The test error is an approximation of the out-of-sample error $E_{out}(h)$:
$$testerror \approx \mathbb{E}_{\mathcal{X,Y}}[loss(h({\bf x}), y)] = E_{out}(h)$$

#### Model complexity, Capacity¶

informal:

The complexity of $\mathcal{H}$ describes how many compex functions the hypothesis set $\mathcal{H}$ can cover.

Measure of model complexity are

• VC-Dimension $d_{vc}$ (Vapnik–Chervonenkis Dimension)

#### Expectation of out-of-sample error¶

Starting point for the Bias-Variance Analysis is the expectation of out-of-sample error:

$$\mathbb{E}_{\mathcal{D}} \left[ E_{out}(h^{\mathcal{D}}({\bf x})) \right]$$

with

• $h^{\mathcal{D}}$: trained hypothesis from $\mathcal{D}$ (training data)
• $E_{out}(h^{\mathcal{D}})$: out-of-sample error for the trained hypothesis
• $\mathbb{E}_{\mathcal{D}}$: Expectation with respect to the training data (a training set is an event).

#### Bias-variance analysis¶

without noise and squared error: \begin{align*} \mathbb{E}_{\mathcal{D}} \left[ E_{out}(h^{\mathcal{D}}) \right] & = \mathbb{E}_{\mathcal{D}} \left[ \mathbb{E}_{\bf x,y} \left[ \left(h^{\mathcal{D}}({\bf x}) - t({\bf x})\right)^2 \right] \right] \\ & = \mathbb{E}_{\bf x,y} \left[ \mathbb{E}_{\mathcal{D}} \left[ \left(h^{\mathcal{D}}({\bf x}) - t({\bf x})\right)^2 \right] \right] \end{align*}

with the average hypothesis $$\tilde{h}({\bf x}) = \mathbb{E}_{\mathcal{D}}\left[ h^{\mathcal{D}}({\bf x}) \right]$$

We get (see e.g. [Abu] p.63): \begin{align*} \mathbb{E}_{\mathcal{D}} \left[ E_{out}(h^{\mathcal{D}}({\bf x})) \right] &= \mathbb{E}_{\bf x,y} \left[ \mathbb{E}_{\mathcal{D}} \left[\left(h^{\mathcal{D}}({\bf x}) - \tilde{h}({\bf x}) \right)^2\right]\right] &+&\mathbb{E}_{\bf x,y} \left[ \left(\tilde{h}({\bf x}) -t({\bf x}) \right)^2 \right]& \\ &= variance &+&bias^2 \end{align*}

with noisy data there is a thrid term: the irreducable error $\mathbb{E}_{\mathcal X}[\epsilon^2]$ (Bayes Error, see e.g. [Has])}

#### Interpretation: Bias¶

$$\mathbb{E}_{\bf x,y} \left[ \left(\tilde{h}({\bf x}) -t({\bf x}) \right)^2 \right]$$ Quadratic deviation of the average hypothesis $\tilde h$ from the target function $t$ $\leftrightarrow$ Underfitting

#### Interpretation: Variance¶

$$\mathbb{E}_{\bf x,y} \left[ \mathbb{E}_{\mathcal{D}} \left[\left(h^{\mathcal{D}}({\bf x}) - \tilde{h}({\bf x}) \right)^2\right]\right]$$

Average quadratic deviation {$h^{\mathcal{D}}$} from the average hyposesis $\tilde h$ $\leftrightarrow$ Overfitting

#### Example¶

from [Abu]

• Target function $sin(x)$
• $p(x)$: Uniform distributed in the range $(0, 2 \pi)$
• no noise
• only two training examples. i.e. $n=2$
• Model 1: $h(x)=w$ (constant)
• Model 2: $h(x)=\theta_0 + \theta_1 x$

#### Exercise¶

Approximate: $\mathbb E (E_{out})$, the bias and the variance for both models by sampling. A sample correspond to a training set (with 2 examples)

In [5]:
x, y  = train_data()
theta_0, theta_1 = get_theta(x, y)
plt.plot(x_, y_, 'b-')
plt.scatter(x,y,color='r')
plt.plot(x_,  theta_0 + theta_1 * x_, 'g-')

Out[5]:
[<matplotlib.lines.Line2D at 0x10ef89358>]

No noise on $y$ : $$\mathbb E_{\bf x, \bf y} [ E_{out} ] = \int_{-\infty}^{\infty} (t(x)- h(x))^2 p(x) dx = \int_{0}^{2 \pi} (t(x)- h(x))^2 dx$$

In [8]:
expectation_Eout_1 = e_out_m1.mean()
print ("expectation of E_out of model 1:", expectation_Eout_1)

expectation of E_out of model 1: 0.7495649709872853

In [9]:
expectation_Eout_2 = e_out_m2.mean()
print ("expectation of E_out of model 2:", expectation_Eout_2)

expectation of E_out of model 2: 1.8852093027834775

In [10]:
# Average Model 1 and 2
plt.plot(x_, y_, 'b-', label='target function')
plt.plot(x_,  avg_theta_0 + avg_theta_1 * x_, 'r-', label='average model 2')
plt.plot(x_,  np.ones(len(x_)) * avg_w, 'g-', label='average model 1')
plt.xlabel("x"); plt.ylabel("y"), plt.legend()

Out[10]:
(<matplotlib.text.Text at 0x10efc5438>,
<matplotlib.legend.Legend at 0x10f4379e8>)

From the plot we see that the bias of model 2 is smaller than the bias of model 1.

In [12]:
print ("Bias of model 1:", bias_1)

Bias of model 1: 0.4994622029361333

In [14]:
print ("Bias of model 2:", bias_2)

Bias of model 2: 0.20764158671729008

In [17]:
print ("Variance of model 1:", variance_1)

Variance of model 1: 0.2502570759559479

In [19]:
print ("Variance of model 2:", variance_2)

Variance of model 2: 1.6858020395241051

In [20]:
print("model 1: E_out ≈ bias + variance:  %f ≈ %f + %f" % (expectation_Eout_1, bias_1, variance_1))
print("model 2: E_out ≈ bias + variance:  %f ≈ %f + %f" % (expectation_Eout_2, bias_2, variance_2))

model 1: E_out ≈ bias + variance:  0.749565 ≈ 0.499462 + 0.250257
model 2: E_out ≈ bias + variance:  1.885209 ≈ 0.207642 + 1.685802


### Learning curves¶

#### Goal¶

$$h(\vec x) \approx t(\vec x)$$ reached if $E_{out}$ is small (small test error)

Both conditions must be satisfied

• $E_{in}$ small training error (low bias, no underfitting)
• Small generalization error $|E_{out} -E_{in}|$ respectively $test error \approx trainings error$ (low variance, no overfitting)

How can we know that the model is adequate?

• Number of training data $m$ of $\mathcal{D}_{train}$
• Model complexity $\mathcal{H}$
• Signal-noise ration: strength of stochastic noise $\epsilon$ in respect to the target function $t$