Validation

Learning objectives

  • Validation data
  • Model Selection
  • Cross Validation

Validation data

Open question:

  • Which ``Model complexity'' is adequate, rep. how large should be the regularization hyperparameter $\lambda$?

Non-training data daten (out-of-sample}) for approximation of the out-of-sample error $E_{out}$.

$$ \mathbb{E}[loss(h({\bf x}), y)] = E_{out}(h) $$

Validation data

Split the $m$ labeled data $\mathcal{D}$ in

  • Training data $\mathcal{D}_{train}$
  • $m_{val}$ validation data $\mathcal{D}_{val}$
$$ \mathcal{D}_{val} = \{ ({\vec x}^{(1)}, y^{(1)}), ({\vec x}^{(2)}, y^{(2)}), \ldots, ({\vec x}^{(m_{val})}, y^{(m_{val})}) \} $$

So we have $m_{train} = m-m_{val}$ data for training of the parameters.

Validation error

$$ E_{val}(h) = \frac{1}{m_{val}} \sum_{i=1}^{m_{val}} loss(h({\bf x}^{(i)}), y^{(i)})) $$$$ \mathbb{E} \left[ E_{val}(h)\right] = \frac{1}{m_{val}} \sum_{i=1}^{m_{val}} \mathbb{E}\left[ loss(h({\bf x}^{(i)}), y^{(i)}))\right ] = E_{out}(h) $$

Reliability of the validation error for estimating $E_{out}$

with the symbols for

  • $\mathcal{L}^{(i)} = loss(h({\bf x}^{(i)}), y^{(i)})$
  • $var[loss(h({\bf x}), y)] = \sigma^2$
$$ var\left[E_{val}(h)\right] = \frac{1}{m_{val}^2} \sum_{i=1}^{m_{val}} \sum_{j=1}^{m_{val}} \left( \mathcal{L}^{(i)} - \mathbb{E} \left[ E_{val}(h)\right] \right) \left( \mathcal{L}^{(j)} - \mathbb{E} \left[ E_{val}(h)\right] \right) $$

no correlations of the losses of the validation data (data is iid) $\rightarrow$
the covariances are all zero (with the Konecker-Delta $\delta_{ij}$)

\begin{align*} var\left[E_{val}(h)\right] & = \frac{1}{m_{val}^2} \sum_{i=1}^{m_{val}} \sum_{j=1}^{m_{val}} \left( \mathcal{L}^{(i)} - \mathbb{E} \left[ E_{val}(h)\right] \right) \left( \mathcal{L}^{(j)} - \mathbb{E} \left[ E_{val}(h)\right] \right) \delta_{ij} \\ & = \frac{1}{m_{val}^2} \sum_{i=1}^{m_{val}} \left( \mathcal{L}^{(i)} - \mathbb{E} \left[ E_{val}(h)\right] \right)^2 \\ & = \frac{\sigma^2}{m_{val}} \end{align*}
 
$$ E_{val}(h) = E_{out}(h) \pm \mathcal O(\frac{1}{\sqrt{m_{val}}}) $$

Data: Training vs. Validierung

Tradeoff of using the data $\mathcal{D}$ for validation or training ($m = m_{train} + m_{val}$)

  • more data for training, the lower is $E_{out}$ (see learning curves)
  • more data for validation, the better is $E_{val}$ as estimate for $E_{out}$

Rule of thumb: approx. 20% of the data for validation (if you have enough data).

Training/validation and train again

  1. Training and validation

    • Train with $m_{train}$ data $\rightarrow h^-$
    • Validierung mit $m_{val}$ data to estimate $E_{out}$ , judgement of $h^-$ (model selection)
  2. Training with all data $m$ $\rightarrow h$

$$ E_{out}(h) \leq E_{out}(h^-) \leq E_{val}(h^-) + \mathcal O(\frac{1}{\sqrt{m_{val}}}) $$

Model Selection

Validation vs. Test}

  • Test data: Only for estimation the quality of the final model
  • Validation data are used indirectly for training:
    • tuning of the hyperparameter
    • comparision for different preprocessing methods
    • Model Selection

Models

Different models are:

  • different regularizations (e.g. hyperparameter $\lambda$)
  • different learning algorithms e.g. logistic regression, SVM, NN
  • different hypothesis models
    • linear vs. polynominals, degree of the polynominal
    • for neuronal networks: number of neurons in a layer, number of layers
    • for SVMs: which kernel etc.

Different Hypotheses $h_1^-, h_2^-, \dots, h_M^-$: $|\mathcal{H}_{val}| = M$

Selection of the Hypothese $h_{m^*}^-$ with the lowest value for $E_{val}$:

$$ E_{out}(h^-_{m^*}) \leq E_{val}(h^-_{m^*}) + \mathcal{O}\left(\sqrt{\frac{\ln M}{m_{val}}}\right) $$

Learning with $m_{train}$ training examples (see learning curves): $$ E_{out}(h_{m^*}) \leq E_{out}(h_{m^*}^-) \leq E_{val}(h_{m^*}) + \mathcal{O}\left(\sqrt{\frac{\ln M}{m_{val}}}\right) $$

Cross Validation

Cross validation

Goal: Model selection

Instead of training of a modell $h^-$ with $\mathcal{D}_{train}$ and validation with $\mathcal{D}_{val}$:

Do this procedure $v$ times: v-fold cross validation ($v$-iterations)

  • Split the data $\mathcal{D}$ in $v$ parts
  • Use of $v-1$ parts for training and one part for validation
  • Do this kind of training/validation $v$-times. In each iteration use another part for validation.
  • Average the validation error (over the iterations) to judge the model.

Extrem case one-leave-out: $v=m_{train}$, i.e. validation with only a training example each.

Guideline

Underfitting - High Bias

  • Use a model with more capacity
  • add meaningful features
  • lower the regularization parameter $\lambda$

Overfitting - High Variance

  • Use a model with less capacity
  • reduce the number of features (feature selection)
  • increase the regularization parameter $\lambda$
  • increase the number of trining data