Skip to content

Data Splitting

Train-Test

The training set has an optimistic bias, since it is used to choose a hypothesis that looks good on it. Hence, we require a unseen set as it is not biased

Once a data set has been used in the learning/validation process, it is β€œcontaminated” – it obtains an optimistic (deceptive) bias, and the error calculated on the data set no longer has the tight generalization bound.

To simulate deployment, any data used for evaluation should be treated as if it does not exist at the time of modelling

Data Split Sets

Train Development
(Inner Validation)
Validation
(Outer Validation)
Test
(Holdout)
Recommend split % 40 20 20 20
In-Sample
(β€˜Seen’ by model)
βœ… ❌ ❌ ❌
EDA
(β€˜Seen’ by analyst)
βœ… ❌ ❌ ❌
Pre-Processing β€œlearning”
(Normalization, Standardization, …)
βœ… ❌ ❌ ❌
Feature Engineering β€œlearning”
(Selection, Transformation, …)
βœ… ❌ ❌ ❌
Underfit Evaluation βœ… ❌ ❌ ❌
Model Tuning βœ… ❌ ❌ ❌
Overfit Evaluation ❌ βœ… ❌ ❌
Hyperparameter Tuning ❌ βœ… ❌ ❌
Model Comparison/Selection ❌ ❌ βœ… ❌
Performance Reporting ❌ ❌ ❌ βœ…
\(\hat f\) \({\hat f}_\text{in}\) \({\hat f}_\text{dev}\) \({\hat f}_\text{val}\) \({\hat f}_\text{test}\)
\(\hat f\) trained on Train Train Train+Dev Train+Dev+Val
\(E\) \(E_\text{in}\) \(E_\text{dev}\) \(E_\text{val}\) \(E_\text{test}\)
Error Names Training error/
In-Sample Error/
Empirical Error/
Empirical Risk
Development Error Validation Error \(\hat E_\text{out}\)
Expected error
Prediction error
Risk
\({\vert H \vert}_\text{set}\) \(\infty\) \(d_\text{vc}\) \({\vert H \vert}_\text{val}\) \(1\)
Comment Used for β€œtraining” on β€œfinalist” set of hypotheses Should not be used for any model decision making
Color Scheme Below Green Yellow Orange Red
\[ \begin{aligned} \mathbb{E}[E_\text{test}] &= E_\text{out} \\ \text{var}[E_\text{test}] &= \dfrac{\sigma^2_{u}}{n_\text{test}} \\ \end{aligned} \]
\[ E_\text{out} \le E_\text{set} + O \left( \sqrt{\dfrac{\ln {\vert H \vert}_\text{set}}{n_\text{set}}} \right) \]

Test-Size Tradeoff

\[ E_\text{out}(\hat f) \underbrace{\approx}_\mathclap{n^*_\text{test} \downarrow} E_\text{out}(\hat f_\text{test}) \underbrace{\approx}_\mathclap{n^*_\text{test} \uparrow} E_\text{test}(\hat f_\text{test}) \]
Small Large
Low Model Bias βœ… ❌
Small Generalization Bound ❌ βœ…
Reliable \(\hat E_\text{out}\)
\(E_\text{out}(\hat f_\text{test})-E_\text{test}(\hat f_\text{test})\)
❌ βœ…
Tested model and final model are same
Small \(E_\text{out}(\hat f) - E_\text{out}(\hat f_\text{test})\)
βœ… ❌
Extreme case
Model performance reporting
β€œwith no certainty, the model is excellent” β€œwith high certainty, the model is crap”

image-20240627171759662

Usage

  1. Training Data
  2. Get \(E_\text{in}\)
  3. Overfit all models
  4. Beat baseline model(s)
  5. Dev data
  6. Get \(E_\text{dev}\)
  7. Tune all models to generalize
  8. Beat baseline model(s)
  9. Validation data
  10. Compare all models on \(E_\text{val}\)
  11. Must beat baseline model(s)
  12. Select best model \(\hat f_\text{val}^*\)
  13. Get accuracy estimate of \(\hat f_\text{val}^*\) on test data: \(E_\text{test}\)

Single metric

  • Use RMS (Root Mean Squared) of train and dev error estimate to compare models
  • Harmonic mean not applicable as it gives more weight to smaller value

image-20240627171239687

Sampling Types

Repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

Hence, these help address the issue of a simple validation: Results can be highly variable, depending on which observations are included in the training set and which are in the validation set

Sampling Comment Better for identifying uncertainty in model
Bootstrapping w/ Replacement Better as we can have a large repetitions of folds parameters
Cross Validation w/o Replacement accuracy

Cross Validation Types

Purpose Comment
Regular \(k\) fold img Obtain uncertainty of evaluation estimates Higher \(k\) recommended for small datasets
Leave-One-Out For very small datasets
\(n < 20\)
\(k=n\)
Shuffled img
Random Permutation img
Stratified Ensures that Train, Validation & Test sets have same distribution
Stratified Shuffle img
Grouped img
Grouped - Leave One Group Out img
Grouped with Random Permutation img
Walk-Forward Expanding Window image-20240312120935236
img
Walk-Forward Rolling Window image-20240312120950200
Blocking img
Purging img
purged_cv
image-20240312120912110
Remove train obs whose labels overlap in time with test labels
Purging & Embargo img Prevent data leakage due to serial correlation \(x_{\text{train}_{-1}} \approx x_{\text{test}_{0}}\)
\(y_{\text{train}_{-1}} \approx y_{\text{test}_{0}}\)
CPCV
(Combinatorial Purged)
image-20240312121125929

Bootstrapping Types

Random sampling with replacement IID
ARIMA Bootstrap Parametric
Moving Block Bootstrap Non-parametric image-20240312121539820
Circular Block Bootstrap Non-parametric
Stationary Bootstrap Non-parametric

Validation Methods

Make sure to shuffle all splits for cross-sectional data

Type Cross-Sectional Time Series Comment
Holdout train_test_split train_test_split
\(k\)- Fold k_fold_cross_validation k_fold_cross_validation 1. Split dataset into \(k\) subsets
2. Train model on \((k-1)\) subsets
3. Evaluate performance on \(1\) subset
4. Summary stats of all iterations
Repeated \(k\)-Fold repeated_k_fold_cross_validation ❌ Repeat \(k\) fold with different splits and random seed
Nested \(k\)-Fold nested_k_fold_cross_validation nested_k_fold_cross_validation
Nested Repeated \(k\)-Fold nested_repeated_k_fold_cross_validation ❌

Decision Parameter \(k\)

There is a tradeoff

Small \(k\) Large \(k\)
Train Size Small Large
Test Size Large Small
Bias High Low
Variance Low High

Usually \(k\) is taken

  • Large dataset: 4
  • Small dataset: 10
  • Tiny dataset: \(k=n\) , ie LOOCV (Leave-One-Out CV)

image-20240627180121829

Data Leakage

Cases where some information from the training set has β€œleaked” into the validation/test set. Estimation of the performances is likely to be optimistic

Due to data leakage, model trained for \(y_t = f(x_j)\) is more likely to be β€˜luckily’ accurate, even if \(x_j\) is irrelevant

Causes

  • Perform feature selection using the whole dataset
  • Perform dimensionality reduction using the whole dataset
  • Perform parameter selection using the whole dataset
  • Perform model or architecture search using the whole dataset
  • Report the performance obtained on the validation set that was used to decide when to stop training (in deep learning)
  • For a given patient, put some of its visits in the training set and some in the validation set
  • For a given 3D medical image, put some 2D slices in the train- ing set and some in the validation set
Last Updated: 2024-05-14 ; Contributors: AhmedThahir

Comments