Data Splitting¶
Train-Test¶
The training set has an optimistic bias, since it is used to choose a hypothesis that looks good on it. Hence, we require a unseen set as it is not biased
Once a data set has been used in the learning/validation process, it is βcontaminatedβ β it obtains an optimistic (deceptive) bias, and the error calculated on the data set no longer has the tight generalization bound.
To simulate deployment, any data used for evaluation should be treated as if it does not exist at the time of modelling
Data Split Sets¶
Train | Development Eyeball | Development Black Box | Validation Inner | Validation Outer | Test (Holdout) | |
---|---|---|---|---|---|---|
Recommend split % | 20 | 5 | 5 | 20 | 20 | 20 |
EDA ('Seen' by analyst) | β | β | β | β | β | β |
In-Sample ('Seen' by model) | β | β | β | β | β | β |
Pre-Processing 'learning' (Normalization, Standardization, β¦) | β | β | β | β | β | β |
Feature Selection | β | β | β | β | β | β |
Causal Discovery | β | β | β | β | β | β |
Feature Engineering 'learning' | β | β | β | β | β | β |
Error Analysis (Inspection) | β | β | β | β | β | β |
Model Tuning | β | β | β | β | β | β |
Underfit Evaluation | β | β | β | β | β | β |
Overfit Evaluation | β | π‘ | β | β | β | β |
Hyperparameter Tuning | β | β | β | β | β | β |
Model Selection | β | β | β | β | β | β |
Model Evaluation (Performance Reporting) | β | β | β | β | β | β |
\(\hat f\) | \({\hat f}_\text{in}\) | \({\hat f}_{\text{dev}_e}\) | \({\hat f}_{\text{dev}_b}\) | \({\hat f}_{\text{val}_i}\) | \({\hat f}_{\text{val}_o}\) | \({\hat f}_\text{test}\) |
\(\hat f\) trained on | Train | Train | Until dev_e | Until dev_b | Until val_i | Until val_o |
\(E\) | \(E_\text{in}\) | \(E_{\text{dev}_e}\) | \(E_{\text{dev}_b}\) | \(E_{\text{val}_i}\) | \(E_{\text{val}_o}\) | \(E_\text{test}\) |
Error Names | Training error/ In-Sample Error/ Empirical Error/ Empirical Risk | Eyeball Development Error | Black Box Development Error | Validation Error | \(\hat E_\text{out}\) Expected error Prediction error Risk | |
No of \(\hat f\) | Any | Any | Any | Any | Low (Usually < 10?) | \(1\) |
\({\vert H \vert}_\text{set}\) | \(\infty\) | \(\infty\) | \(d_\text{vc}\) | \({\vert H \vert}_{\text{val}_i}\) | \({\vert H \vert}_{\text{val}_o}\) | \(1\) |
Comment | Used for βtrainingβ on βfinalistβ set of hypotheses | Should not be used for any model decision making | ||||
Color Scheme Below | Green | Green | Green | Yellow | Orange | Red |
"\(\hat f\) trained on" implies that data should be split amongst to be used for - 60: Model fitting - 20: Model confidence interval generation (if required), else use this also for model fitting - 20: Model calibration - Confidence interval calibration - Classification proportion calibration
Test-Size Tradeoff¶
Small | Large | |
---|---|---|
Low Model Bias | β | β |
Small Generalization Bound | β | β |
Reliable \(\hat E_\text{out}\) \(E_\text{out}(\hat f_\text{test})-E_\text{test}(\hat f_\text{test})\) | β | β |
Tested model and final model are same Small \(E_\text{out}(\hat f) - E_\text{out}(\hat f_\text{test})\) | β | β |
Extreme case Model performance reporting | βwith no certainty, the model is excellentβ | βwith high certainty, the model is crapβ |
Usage¶
- Split data
- Competition
- Create a self-hosted competition ('Kaggle' equivalent)
- Overfit to single train sample
- Overfit to entire dataset
- Beat baseline model(s)
- Tune to generalize to dev set
- Beat baseline model(s)
- Tune hyperparameters on inner validation set
- Compare all models on \(E_\text{val}\) on outer validation set
- Must beat baseline model(s)
- Select best model \(\hat f_{\text{val}_o}^*\)
- Get accuracy estimate of \(\hat f_\text{val}^*\) on test data: \(E_\text{test}\)
Single metric
- Use RMS (Root Mean Squared) of train and dev error estimate to compare models
- Harmonic mean not applicable as it gives more weight to smaller value
Sampling Types¶
Repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
Hence, these help address the issue of a simple validation: Results can be highly variable, depending on which observations are included in the training set and which are in the validation set
Bootstrapping | Cross Validation | |
---|---|---|
Sampling | w/ Replacement | w/o Replacement |
Estimate uncertainty in model parameters | β | β |
Estimate expected model evaluation metric | β | β |
Estimate model stability: standard error in model evaluation metric | β | β |
Model Tuning | β (check if change caused statistically-significant improvement) | β |
Hyperparameter Tuning | β (check if change caused statistically-significant improvement) | β |
Model Selection | β | β |
Advantage | Large repetitions of folds: No assumptions for standard error estimation | |
Comment | The resulting distribution will give the sampling distribution of the evaluation metric |
Cross Validation Types¶
Purpose | Comment | ||
---|---|---|---|
Regular \(k\) fold | Obtain uncertainty of evaluation estimates | Higher \(k\) recommended for small datasets | |
Leave-One-Out | For very small datasets \(n < 20\) | \(k=n\) | |
Shuffled | |||
Random Permutation | |||
Stratified | Ensures that Train, Validation & Test sets have same distribution | ||
Stratified Shuffle | |||
Grouped | |||
Grouped - Leave One Group Out | |||
Grouped with Random Permutation | |||
Walk-Forward Expanding Window | |||
Walk-Forward Rolling Window | |||
Blocking | |||
Purging | Remove train obs whose labels overlap in time with test labels | ||
Purging & Embargo | Prevent data leakage due to serial correlation \(x_{\text{train}_{-1}} \approx x_{\text{test}_{0}}\) \(y_{\text{train}_{-1}} \approx y_{\text{test}_{0}}\) | ||
CPCV (Combinatorial Purged) |
Bootstrapping Types¶
Advantage | Disadvantage | |||
---|---|---|---|---|
Random sampling with replacement | IID | |||
ARIMA Bootstrap | Parametric | |||
Moving Block Bootstrap | Non-parametric | |||
Circular Block Bootstrap | Non-parametric | |||
Stationary Bootstrap | Non-parametric | |||
Test-Set Bootstrap | Only bootstrap the out-of-sample set (dev, val, test) | No refitting: Great for Deep Learning | Large out-of-sample size required for good bootstrapping |
Validation Methods¶
Type | Cross-Sectional | Time Series | Comment |
---|---|---|---|
Holdout | |||
\(k\)- Fold | 1. Split dataset into \(k\) subsets 2. Train model on \((k-1)\) subsets 3. Evaluate performance on \(1\) subset 4. Summary stats of all iterations | ||
Repeated \(k\)-Fold | β | Repeat \(k\) fold with different splits and random seed | |
Nested \(k\)-Fold | |||
Nested Repeated \(k\)-Fold | β |
For - cross-sectional data - make sure to shuffle all splits - time-series data, always add - purging - embargo - step size/gap between splits - estimates of error/loss for nearby splits will be correlated, so no point in estimating them - larger step size \(\implies\) fewer splits \(\implies\) saves time - always take step size \(>1\), as it is pointless to have step size \(= 1\)
Decision Parameter \(k\)¶
There is a tradeoff
Small \(k\) | Large \(k\) | |
---|---|---|
Train Size | Small | Large |
Test Size | Large | Small |
Bias | High | Low |
Variance | Low | High |
Usually \(k\) is taken
- Large dataset: 4
- Small dataset: 10
- Tiny dataset: \(k=n\) , ie LOOCV (Leave-One-Out CV)
Data Leakage¶
Cases where some information from the training set has βleakedβ into the validation/test set. Estimation of the performances is likely to be optimistic
Due to data leakage, model trained for \(y_t = f(x_j)\) is more likely to be βluckilyβ accurate, even if \(x_j\) is irrelevant
Causes
- Perform feature selection using the whole dataset
- Perform dimensionality reduction using the whole dataset
- Perform parameter selection using the whole dataset
- Perform model or architecture search using the whole dataset
- Report the performance obtained on the validation set that was used to decide when to stop training (in deep learning)
- For a given patient, put some of its visits in the training set and some in the validation set
- For a given 3D medical image, put some 2D slices in the train- ing set and some in the validation set