Regularization

Learning Objectives

  • Regularization (L2)
  • Capacity Control
  • Regularization for linear and logistic regression

Regularization

Intuition of regularization

Known :

  • If $\mathcal{H}$ to complex: most probable overfitting.
    • Polynominals with high degree have higher model complexity as polynominals with lower degree.

Polynom of grade 4: $$ h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4 $$

  • Constraint $\theta_3 = \theta_4 = 0$ should lower the model complexity.

  • Constraint: $|\theta_3| \leq \epsilon$ and $|\theta_4| \leq \epsilon$ with small $\epsilon$ $\rightarrow$ model complexity should only has increased a little bit.

Cost function for regularization

Cost funktion for polynom of degree 4 with constraint: small values for $\theta_3$ and $\theta_4$:

$$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \theta_3^2 + \lambda \theta_4^2 $$

with large hyperparameter $\lambda$.

Regularization for linear regression

  • Small values for $\theta_0, \theta_1, \ldots, \theta_n$ result in ``simpler'' hypotheses $\rightarrow$ less overfitting
  • Many features ( e.g. example "housing prices") $\rightarrow$ same features have less or no information $\rightarrow$ bad generalization.
    Regularization on $\theta$-values suppresses such feature.

Cost function with regularization

\begin{align*} J(\theta) = &\frac{1}{m} \left[ \sum_{i=1}^{m} loss(h_\theta(x^{(i)}),y^{(i)}) + \frac{\lambda}{2} \sum_{j=1}^n \theta_j^2 \right] %// = & \frac{1}{2m} \left[ \sum_{i=1}^{m} loss(h_\theta(x^{(i)}),y^{(i)}) %+ \frac{\lambda}{2} \vec \theta^T \vec \theta \right] \end{align*} with

  • $\lambda$: regularization hyperparameter - controlls the complexity of the model.
    • large $\lambda$ $\rightarrow$ low complexity
    • small $\lambda$ $\rightarrow$ high complexity
  • typically: no regularization of $\theta_0$

Augmented Error

Instead of minimization of $J_{train}$ (training error) we minimize an augmented error}: $$ J_{aug} = J_{train}(\theta) + \frac{\lambda}{m}\Omega(\theta) = J_{train}(\theta) + overfit penalty $$

Recap: gradient descent

Cost function $$ J(\theta) = \frac{1}{m} \left[ \sum_{i=1}^{m} loss(h_\theta(x^{(i)}), y^{(i)}) + \frac{\lambda}{2} \sum_{j=1}^n \theta_j^2 \right] $$

with the Update Rule

$$ \theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) $$

gradient descent with regularization for the linar and logistic regression

for $j=0$ (no change) $$ \theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(\vec x^{(i)}) - y^{(i)}) x_0^{(i)} $$ for $j \neq 0$ $$ \theta_j \leftarrow \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \frac{\lambda}{m} \theta_j \right] $$

Interpretation: Weight Decay

Transformation of the update rule for $j \neq 0$ results in $$ \theta_j \leftarrow \theta_j (1-\alpha \frac{\lambda}{m}) - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(\vec x^{(i)}) - y^{(i)}) x_j^{(i)} $$

In comparision with the update rule without regularization we see that $\theta_j$ is multiplied with an weight decay factor: $$ (1-\alpha \frac{\lambda}{m}) < 1 $$

Poison or medicine

(from [Abu])

Hypothesis set $\mathcal{H}$: polynominals of degree 4

$\lambda$ and noise

(from [Abu])

Performance of the uniform regularizer at differnt levels of stochastic noise $\sigma$. Both target and model are polynominals of order 15.

Exercise

Extend your "linear and logistic regression" implementation with regularization.

  1. Extend the get_cost_function(loss, lambda_reg)
  2. Extend "gradient descent" and learn a logistic regression model with regularization.
    1. For debugging: Plot the progress (cost value over iterations)
    2. How does the theta values differ from the unregularized logistic regression? Is there a change in the decision boundary? Is there a difference in the prediction of the class probabilities?