Skip to content

Data Splitting

Train-Test

The training set has an optimistic bias, since it is used to choose a hypothesis that looks good on it. Hence, we require a unseen set as it is not biased

Once a data set has been used in the learning/validation process, it is β€œcontaminated” – it obtains an optimistic (deceptive) bias, and the error calculated on the data set no longer has the tight generalization bound.

To simulate deployment, any data used for evaluation should be treated as if it does not exist at the time of modelling

Data Split Sets

Train Development
Eyeball
Development
Black Box
Validation
Inner
Validation
Outer
Test
(Holdout)
Recommend split % 20 5 5 20 20 20
EDA
('Seen' by analyst)
βœ… ❌ ❌ ❌ ❌ ❌
In-Sample
('Seen' by model)
βœ… ❌ ❌ ❌ ❌ ❌
Pre-Processing 'learning'
(Normalization, Standardization, …)
βœ… ❌ ❌ ❌ ❌ ❌
Feature Selection βœ… ❌ ❌ ❌ ❌ ❌
Causal Discovery βœ… ❌ ❌ ❌ ❌ ❌
Feature Engineering 'learning' βœ… ❌ ❌ ❌ ❌ ❌
Error Analysis
(Inspection)
βœ… βœ… ❌ ❌ ❌ ❌
Model Tuning βœ… βœ… ❌ ❌ ❌ ❌
Underfit Evaluation βœ… βœ… ❌ ❌ ❌ ❌
Overfit Evaluation ❌ 🟑 βœ… ❌ ❌ ❌
Hyperparameter Tuning ❌ ❌ ❌ βœ… ❌ ❌
Model Selection ❌ ❌ ❌ ❌ βœ… ❌
Model Evaluation
(Performance Reporting)
❌ ❌ ❌ ❌ ❌ βœ…
\(\hat f\) \({\hat f}_\text{in}\) \({\hat f}_{\text{dev}_e}\) \({\hat f}_{\text{dev}_b}\) \({\hat f}_{\text{val}_i}\) \({\hat f}_{\text{val}_o}\) \({\hat f}_\text{test}\)
\(\hat f\) trained on Train Train Until dev_e Until dev_b Until val_i Until val_o
\(E\) \(E_\text{in}\) \(E_{\text{dev}_e}\) \(E_{\text{dev}_b}\) \(E_{\text{val}_i}\) \(E_{\text{val}_o}\) \(E_\text{test}\)
Error Names Training error/
In-Sample Error/
Empirical Error/
Empirical Risk
Eyeball Development Error Black Box Development Error Validation Error \(\hat E_\text{out}\)
Expected error
Prediction error
Risk
No of \(\hat f\) Any Any Any Any Low
(Usually < 10?)
\(1\)
\({\vert H \vert}_\text{set}\) \(\infty\) \(\infty\) \(d_\text{vc}\) \({\vert H \vert}_{\text{val}_i}\) \({\vert H \vert}_{\text{val}_o}\) \(1\)
Comment Used for β€œtraining” on β€œfinalist” set of hypotheses Should not be used for any model decision making
Color Scheme Below Green Green Green Yellow Orange Red

"\(\hat f\) trained on" implies that data should be split amongst to be used for - 60: Model fitting - 20: Model confidence interval generation (if required), else use this also for model fitting - 20: Model calibration - Confidence interval calibration - Classification proportion calibration

\[ \begin{aligned} \mathbb{E}[E_\text{test}] &= E_\text{out} \\ \text{var}[E_\text{test}] &= \dfrac{\sigma^2_{u}}{n_\text{test}} \\ \end{aligned} \]
\[ E_\text{out} \le E_\text{set} + O \left( \sqrt{\dfrac{\ln {\vert H \vert}_\text{set}}{n_\text{set}}} \right) \]

Test-Size Tradeoff

\[ E_\text{out}(\hat f) \underbrace{\approx}_\mathclap{n^*_\text{test} \downarrow} E_\text{out}(\hat f_\text{test}) \underbrace{\approx}_\mathclap{n^*_\text{test} \uparrow} E_\text{test}(\hat f_\text{test}) \]
Small Large
Low Model Bias βœ… ❌
Small Generalization Bound ❌ βœ…
Reliable \(\hat E_\text{out}\)
\(E_\text{out}(\hat f_\text{test})-E_\text{test}(\hat f_\text{test})\)
❌ βœ…
Tested model and final model are same
Small \(E_\text{out}(\hat f) - E_\text{out}(\hat f_\text{test})\)
βœ… ❌
Extreme case
Model performance reporting
β€œwith no certainty, the model is excellent” β€œwith high certainty, the model is crap”

image-20240627171759662

Usage

  1. Split data
  2. Competition
    1. Create a self-hosted competition ('Kaggle' equivalent)
  3. Overfit to single train sample
  4. Overfit to entire dataset
    • Beat baseline model(s)
  5. Tune to generalize to dev set
    • Beat baseline model(s)
  6. Tune hyperparameters on inner validation set
  7. Compare all models on \(E_\text{val}\) on outer validation set
    • Must beat baseline model(s)
  8. Select best model \(\hat f_{\text{val}_o}^*\)
  9. Get accuracy estimate of \(\hat f_\text{val}^*\) on test data: \(E_\text{test}\)

Single metric

  • Use RMS (Root Mean Squared) of train and dev error estimate to compare models
  • Harmonic mean not applicable as it gives more weight to smaller value

image-20240627171239687

Sampling Types

Repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

Hence, these help address the issue of a simple validation: Results can be highly variable, depending on which observations are included in the training set and which are in the validation set

Bootstrapping Cross Validation
Sampling w/ Replacement w/o Replacement
Estimate uncertainty in model parameters βœ… ❌
Estimate expected model evaluation metric βœ… ❌
Estimate model stability: standard error in model evaluation metric βœ… ❌
Model Tuning βœ… (check if change caused statistically-significant improvement) βœ…
Hyperparameter Tuning βœ… (check if change caused statistically-significant improvement) βœ…
Model Selection ❌ βœ…
Advantage Large repetitions of folds: No assumptions for standard error estimation
Comment The resulting distribution will give the sampling distribution of the evaluation metric

Cross Validation Types

Purpose Comment
Regular \(k\) fold img Obtain uncertainty of evaluation estimates Higher \(k\) recommended for small datasets
Leave-One-Out For very small datasets
\(n < 20\)
\(k=n\)
Shuffled img
Random Permutation img
Stratified Ensures that Train, Validation & Test sets have same distribution
Stratified Shuffle img
Grouped img
Grouped - Leave One Group Out img
Grouped with Random Permutation img
Walk-Forward Expanding Window image-20240312120935236
img
Walk-Forward Rolling Window image-20240312120950200
Blocking img
Purging img
purged_cv
image-20240312120912110
Remove train obs whose labels overlap in time with test labels
Purging & Embargo img Prevent data leakage due to serial correlation \(x_{\text{train}_{-1}} \approx x_{\text{test}_{0}}\)
\(y_{\text{train}_{-1}} \approx y_{\text{test}_{0}}\)
CPCV
(Combinatorial Purged)
image-20240312121125929

Bootstrapping Types

Advantage Disadvantage
Random sampling with replacement IID
ARIMA Bootstrap Parametric
Moving Block Bootstrap Non-parametric image-20240312121539820
Circular Block Bootstrap Non-parametric
Stationary Bootstrap Non-parametric
Test-Set Bootstrap Only bootstrap the out-of-sample set (dev, val, test)
No refitting: Great for Deep Learning Large out-of-sample size required for good bootstrapping

Validation Methods

Type Cross-Sectional Time Series Comment
Holdout train_test_split train_test_split
\(k\)- Fold k_fold_cross_validation k_fold_cross_validation 1. Split dataset into \(k\) subsets
2. Train model on \((k-1)\) subsets
3. Evaluate performance on \(1\) subset
4. Summary stats of all iterations
Repeated \(k\)-Fold repeated_k_fold_cross_validation ❌ Repeat \(k\) fold with different splits and random seed
Nested \(k\)-Fold nested_k_fold_cross_validation nested_k_fold_cross_validation
Nested Repeated \(k\)-Fold nested_repeated_k_fold_cross_validation ❌

For - cross-sectional data - make sure to shuffle all splits - time-series data, always add - purging - embargo - step size/gap between splits - estimates of error/loss for nearby splits will be correlated, so no point in estimating them - larger step size \(\implies\) fewer splits \(\implies\) saves time - always take step size \(>1\), as it is pointless to have step size \(= 1\)

Decision Parameter \(k\)

There is a tradeoff

Small \(k\) Large \(k\)
Train Size Small Large
Test Size Large Small
Bias High Low
Variance Low High

Usually \(k\) is taken

  • Large dataset: 4
  • Small dataset: 10
  • Tiny dataset: \(k=n\) , ie LOOCV (Leave-One-Out CV)

image-20240627180121829

Data Leakage

Cases where some information from the training set has β€œleaked” into the validation/test set. Estimation of the performances is likely to be optimistic

Due to data leakage, model trained for \(y_t = f(x_j)\) is more likely to be β€˜luckily’ accurate, even if \(x_j\) is irrelevant

Causes

  • Perform feature selection using the whole dataset
  • Perform dimensionality reduction using the whole dataset
  • Perform parameter selection using the whole dataset
  • Perform model or architecture search using the whole dataset
  • Report the performance obtained on the validation set that was used to decide when to stop training (in deep learning)
  • For a given patient, put some of its visits in the training set and some in the validation set
  • For a given 3D medical image, put some 2D slices in the train- ing set and some in the validation set
Last Updated: 2024-12-26 ; Contributors: AhmedThahir, web-flow

Comments