Cross-validation (statistics) Study Guide
Study Guide
📖 Core Concepts
Cross‑validation (CV) – Resampling technique that splits data into training and validation subsets repeatedly to estimate how a model will perform on unseen data.
Training vs. validation performance – Training error is optimistically low; validation error approximates true predictive error.
Bias–variance trade‑off in CV – Small validation sets give low bias but high variance; larger sets reduce variance but can increase bias.
Nested CV – Two‑layer CV (outer for performance estimation, inner for hyper‑parameter tuning) that prevents optimistic bias when model selection is involved.
Blocking (temporal/spatial) – Splitting that respects dependence structures (e.g., rolling origin for time series, spatial blocks for ecological data).
📌 Must Remember
Leave‑p‑out: repeats = $\binom{n}{p}$; infeasible for moderate $p$, large $n$.
Leave‑One‑Out (LOO): $n$ fits; still costly for large $n$.
k‑fold CV: data → $k$ equal folds; each fold serves once as validation. Common $k=10$.
Stratified k‑fold: preserves class proportions (classification) or response distribution (regression).
Repeated k‑fold / Monte‑Carlo CV: multiple random partitions → more stable estimate.
Holdout: single split → high variance; use cautiously.
Nested CV workflow: outer $k$ folds → test set; inner $l$ folds → hyper‑parameter tuning.
Performance metrics:
Classification → misclassification error (from confusion matrix).
Regression → MSE $= \frac{1}{m}\sum (yi-\hat yi)^2$, RMSE $\sqrt{\text{MSE}}$, MAD.
Leakage rule: All preprocessing (feature selection, scaling) must be done inside each training fold.
🔄 Key Processes
Standard k‑fold CV
Randomly shuffle data.
Partition into $k$ folds of size $n/k$.
For each fold $i$: train on all folds ≠ $i$, validate on fold $i$.
Record performance metric $Fi$.
Compute overall estimate $\displaystyle \hat F = \frac{1}{k}\sum{i=1}^k Fi$.
Nested CV (k × l)
Split data into $k$ outer folds.
For each outer fold: treat it as test set, keep remaining $k-1$ folds as outer‑training.
Within outer‑training, perform $l$‑fold inner CV to tune hyper‑parameters.
Choose best hyper‑parameters, refit on whole outer‑training, evaluate on outer test fold.
Aggregate outer‑fold performances for unbiased error estimate.
Rolling‑origin (time‑series) CV
Choose initial training window $[1, t]$.
Predict $t+1$ (or horizon $h$).
Expand (or roll) window forward, repeat steps 1‑2.
Average forecast errors across all origins.
Spatial blocked CV
Partition geographic area into contiguous blocks.
For each block, train on all other blocks, validate on the held‑out block.
Ensure no neighboring block shares border with training set (optional buffer).
🔍 Key Comparisons
Leave‑One‑Out vs. k‑fold
LOO: $n$ fits, low bias, high variance, computationally heavy.
k‑fold (e.g., $k=10$): $k$ fits, moderate bias, lower variance, faster.
Holdout vs. Repeated Random Sub‑sampling (Monte‑Carlo)
Holdout: single split → high variability.
Monte‑Carlo: many random splits → reduces variability, but may reuse observations unevenly.
Stratified k‑fold vs. plain k‑fold
Stratified: preserves class distribution → better for imbalanced classification.
Plain: may produce folds with skewed class ratios → misleading error.
Temporal blocking vs. Random k‑fold (time series)
Temporal blocking respects order → realistic forecast errors.
Random k‑fold leaks future information → overly optimistic performance.
⚠️ Common Misunderstandings
“Preprocess once, then CV.” – Doing feature selection or scaling on the full data before CV introduces leakage and yields optimistic bias.
“LOO is always best because it uses almost all data.” – For noisy data, LOO’s high variance can mislead; k‑fold often gives a better bias‑variance balance.
“Higher $k$ always improves estimate.” – Beyond $k\approx10$, gains are marginal while computational cost rises.
“Cross‑validation guarantees external validity.” – CV assumes training and validation data come from the same distribution; real‑world shift (e.g., new region, future time) can still break the estimate.
🧠 Mental Models / Intuition
“CV is a rehearsal.” – Imagine training the model repeatedly on different “practice” sets; the average score tells you how it will likely perform on the real “stage.”
“Nested CV is a two‑stage audition.” – First, audition (inner CV) to pick the best costume (hyper‑parameters); second, perform (outer CV) to see the final audience reaction without bias.
“Blocking = respecting the storyline.” – In a story (time series or space), you can’t reveal future chapters or neighboring scenes to the reader before they occur; blocking enforces that order.
🚩 Exceptions & Edge Cases
Very small datasets (< 30 obs). – LOO may be the only feasible exhaustive method, but variance will be huge; consider bootstrapping instead.
Highly imbalanced classification. – Even stratified k‑fold can yield folds with too few minority cases; use stratified repeated CV or combine with SMOTE inside each training fold.
Heavy‑tailed error distributions. – Median absolute deviation may be more robust than MSE; choose metric accordingly.
Computationally expensive models (e.g., deep nets). – Use fewer folds (e.g., $k=5$) or approximate CV with a validation curve.
📍 When to Use Which
| Situation | Recommended CV technique |
|-----------|--------------------------|
| General predictive performance estimate, moderate $n$ | k‑fold (k≈5‑10), optionally stratified |
| Hyper‑parameter tuning and performance estimate | Nested k × l CV |
| Very small $n$ (≤ 30) | Leave‑One‑Out or Leave‑p‑Out (p small) |
| Imbalanced classification | Stratified k‑fold (repeat 3‑5 times) |
| Time‑series forecasting | Rolling / forward‑chaining CV (blocked) |
| Spatially autocorrelated data (ecology, SDM) | Spatial block CV (contiguous blocks, no border leakage) |
| Need quick, cheap estimate | Holdout (only if variance not critical) |
| Want stable estimate without full CV cost | Repeated random sub‑sampling (Monte‑Carlo) |
👀 Patterns to Recognize
Performance improves as validation set size shrinks → likely over‑fitting (look for LOO “too good”).
Large variance across folds → data heterogeneity; consider stratification or blocking.
Consistent error drop after hyper‑parameter tuning only in inner folds → possible leakage; check nested CV.
Sudden spikes in error for specific folds → outlier or data‑drift; inspect that fold’s characteristics.
🗂️ Exam Traps
Choosing “the best $k$ is always 10.” – Exam may ask why $k$ can be 5 or 20; answer: trade‑off, data size, computational budget.
Confusing LOO with leave‑p‑out (p = 1). – LOO is a special case; leave‑p‑out repeats $\binom{n}{p}$ times, which becomes infeasible quickly.
Assuming random k‑fold is fine for time series. – This breaks temporal dependence → overestimates accuracy.
Reporting only the mean CV error without variance. – Many questions expect you to mention the estimate’s variance (Bengio & Grandvalet 2004).
Selecting features on the whole dataset before CV. – Leads to optimistic bias; the correct step is to embed feature selection inside each training fold.
---
Use this guide as a rapid “cheat‑sheet” before your exam – focus on the bolded decision rules and the “when to use which” table to pick the right validation strategy quickly.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or