Data Splitting¶
Train-Test¶
The training set has an optimistic bias, since it is used to choose a hypothesis that looks good on it. Hence, we require a unseen set as it is not biased
Once a data set has been used in the learning/validation process, it is βcontaminatedβ β it obtains an optimistic (deceptive) bias, and the error calculated on the data set no longer has the tight generalization bound.
To simulate deployment, any data used for evaluation should be treated as if it does not exist at the time of modelling
Data Split Sets¶
Train | Development (Inner Validation) | Validation (Outer Validation) | Test (Holdout) | |
---|---|---|---|---|
Recommend split % | 40 | 20 | 20 | 20 |
In-Sample (βSeenβ by model) | β | β | β | β |
EDA (βSeenβ by analyst) | β | β | β | β |
Pre-Processing βlearningβ (Normalization, Standardization, β¦) | β | β | β | β |
Feature Engineering βlearningβ (Selection, Transformation, β¦) | β | β | β | β |
Underfit Evaluation | β | β | β | β |
Model Tuning | β | β | β | β |
Overfit Evaluation | β | β | β | β |
Hyperparameter Tuning | β | β | β | β |
Model Comparison/Selection | β | β | β | β |
Performance Reporting | β | β | β | β |
\(\hat f\) | \({\hat f}_\text{in}\) | \({\hat f}_\text{dev}\) | \({\hat f}_\text{val}\) | \({\hat f}_\text{test}\) |
\(\hat f\) trained on | Train | Train | Train+Dev | Train+Dev+Val |
\(E\) | \(E_\text{in}\) | \(E_\text{dev}\) | \(E_\text{val}\) | \(E_\text{test}\) |
Error Names | Training error/ In-Sample Error/ Empirical Error/ Empirical Risk | Development Error | Validation Error | \(\hat E_\text{out}\) Expected error Prediction error Risk |
\({\vert H \vert}_\text{set}\) | \(\infty\) | \(d_\text{vc}\) | \({\vert H \vert}_\text{val}\) | \(1\) |
Comment | Used for βtrainingβ on βfinalistβ set of hypotheses | Should not be used for any model decision making | ||
Color Scheme Below | Green | Yellow | Orange | Red |
Test-Size Tradeoff¶
Small | Large | |
---|---|---|
Low Model Bias | β | β |
Small Generalization Bound | β | β |
Reliable \(\hat E_\text{out}\) \(E_\text{out}(\hat f_\text{test})-E_\text{test}(\hat f_\text{test})\) | β | β |
Tested model and final model are same Small \(E_\text{out}(\hat f) - E_\text{out}(\hat f_\text{test})\) | β | β |
Extreme case Model performance reporting | βwith no certainty, the model is excellentβ | βwith high certainty, the model is crapβ |
Usage¶
- Training Data
- Get \(E_\text{in}\)
- Overfit all models
- Beat baseline model(s)
- Dev data
- Get \(E_\text{dev}\)
- Tune all models to generalize
- Beat baseline model(s)
- Validation data
- Compare all models on \(E_\text{val}\)
- Must beat baseline model(s)
- Select best model \(\hat f_\text{val}^*\)
- Get accuracy estimate of \(\hat f_\text{val}^*\) on test data: \(E_\text{test}\)
Single metric
- Use RMS (Root Mean Squared) of train and dev error estimate to compare models
- Harmonic mean not applicable as it gives more weight to smaller value
Sampling Types¶
Repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
Hence, these help address the issue of a simple validation: Results can be highly variable, depending on which observations are included in the training set and which are in the validation set
Sampling | Comment | Better for identifying uncertainty in model | |
---|---|---|---|
Bootstrapping | w/ Replacement | Better as we can have a large repetitions of folds | parameters |
Cross Validation | w/o Replacement | accuracy |
Cross Validation Types¶
Purpose | Comment | ||
---|---|---|---|
Regular \(k\) fold | Obtain uncertainty of evaluation estimates | Higher \(k\) recommended for small datasets | |
Leave-One-Out | For very small datasets \(n < 20\) | \(k=n\) | |
Shuffled | |||
Random Permutation | |||
Stratified | Ensures that Train, Validation & Test sets have same distribution | ||
Stratified Shuffle | |||
Grouped | |||
Grouped - Leave One Group Out | |||
Grouped with Random Permutation | |||
Walk-Forward Expanding Window | |||
Walk-Forward Rolling Window | |||
Blocking | |||
Purging | Remove train obs whose labels overlap in time with test labels | ||
Purging & Embargo | Prevent data leakage due to serial correlation \(x_{\text{train}_{-1}} \approx x_{\text{test}_{0}}\) \(y_{\text{train}_{-1}} \approx y_{\text{test}_{0}}\) | ||
CPCV (Combinatorial Purged) |
Bootstrapping Types¶
Random sampling with replacement | IID | |
ARIMA Bootstrap | Parametric | |
Moving Block Bootstrap | Non-parametric | |
Circular Block Bootstrap | Non-parametric | |
Stationary Bootstrap | Non-parametric |
Validation Methods¶
Make sure to shuffle all splits for cross-sectional data
Type | Cross-Sectional | Time Series | Comment |
---|---|---|---|
Holdout | |||
\(k\)- Fold | 1. Split dataset into \(k\) subsets 2. Train model on \((k-1)\) subsets 3. Evaluate performance on \(1\) subset 4. Summary stats of all iterations | ||
Repeated \(k\)-Fold | β | Repeat \(k\) fold with different splits and random seed | |
Nested \(k\)-Fold | |||
Nested Repeated \(k\)-Fold | β |
Decision Parameter \(k\)¶
There is a tradeoff
Small \(k\) | Large \(k\) | |
---|---|---|
Train Size | Small | Large |
Test Size | Large | Small |
Bias | High | Low |
Variance | Low | High |
Usually \(k\) is taken
- Large dataset: 4
- Small dataset: 10
- Tiny dataset: \(k=n\) , ie LOOCV (Leave-One-Out CV)
Data Leakage¶
Cases where some information from the training set has βleakedβ into the validation/test set. Estimation of the performances is likely to be optimistic
Due to data leakage, model trained for \(y_t = f(x_j)\) is more likely to be βluckilyβ accurate, even if \(x_j\) is irrelevant
Causes
- Perform feature selection using the whole dataset
- Perform dimensionality reduction using the whole dataset
- Perform parameter selection using the whole dataset
- Perform model or architecture search using the whole dataset
- Report the performance obtained on the validation set that was used to decide when to stop training (in deep learning)
- For a given patient, put some of its visits in the training set and some in the validation set
- For a given 3D medical image, put some 2D slices in the train- ing set and some in the validation set