Model selection Study Guide
Study Guide
📖 Core Concepts
Model selection – choosing the best model (or algorithm, features, hyper‑parameters) from a set of candidates using a performance criterion.
Inference vs. Prediction – Inference seeks a model that reveals the true data‑generating mechanism; Prediction seeks the model that forecasts new data most accurately, even if it’s less interpretable.
Occam’s razor – when two models perform similarly, prefer the simpler (fewer parameters) one.
Selection consistency – a procedure that picks the true model with probability → 1 as sample size $n \to \infty$.
Bias‑variance trade‑off – simple models → high bias, low variance; complex models → low bias, high variance. Good selection balances the two.
---
📌 Must Remember
AIC: $ \text{AIC} = -2\log L + 2k $ (penalizes each parameter $k$).
BIC: $ \text{BIC} = -2\log L + k\log n $ (stronger penalty when $n$ is large).
Mallows’s $Cp$: $ Cp = \frac{\text{RSS}}{\hat{\sigma}^2} - (n - 2p) $; choose models with $Cp \approx p$.
PRESS: $\text{PRESS}= \sum{i=1}^{n}(yi-\hat{y}{(i)})^2$ (LOO‑CV prediction error).
Bayes factor: $BF{12}= \dfrac{p(\text{data}\mid M1)}{p(\text{data}\mid M2)}$; $BF>1$ favors $M1$.
Cross‑validation – most reliable estimate of out‑of‑sample performance; $k$‑fold CV repeats training/validation $k$ times.
---
🔄 Key Processes
Define candidate set
Specify functional form, variables, and assumptions for each model.
Apply data transformations (log, scaling) if needed.
Fit all candidates (e.g., via maximum likelihood).
Compute selection metric (AIC, BIC, CV error, etc.) for each model.
Rank models by metric; apply Occam’s razor when scores are close.
Validate chosen model using an independent resampling method (e.g., $k$‑fold CV or PRESS).
---
🔍 Key Comparisons
AIC vs. BIC
AIC: $+2k$ penalty → favors predictive accuracy, tolerates larger models.
BIC: $+k\log n$ penalty → stronger penalty for complexity, tends toward true model (selection consistency).
Cross‑validation vs. PRESS
CV: flexible (any $k$), more computationally intensive, gives full predictive distribution.
PRESS: equivalent to leave‑one‑out CV for linear models; cheap to compute but limited to linear settings.
Feature selection vs. Hyperparameter optimization
Feature selection reduces input dimensionality (which variables are used).
Hyperparameter optimization tunes algorithm settings (e.g., regularization strength).
---
⚠️ Common Misunderstandings
“Lower AIC always means the best model.” – AIC is relative; only compare among the same candidate set.
“Cross‑validation guarantees the globally best model.” – CV estimates performance on the data distribution, but poor candidate sets or data leakage still lead to bad choices.
“Complex models are always better for prediction.” – Over‑fitting can inflate apparent performance; bias‑variance balance is crucial.
“Bayes factor > 1 always strong evidence.” – Interpretation depends on magnitude (e.g., BF = 2 is weak).
---
🧠 Mental Models / Intuition
Penalty = “price of flexibility.” Think of each extra parameter as a “tax” that must be justified by a substantial drop in lack‑of‑fit.
Bias‑variance seesaw: moving toward lower bias (more complexity) automatically raises variance; the optimal point is where total error is minimal.
Occam’s razor as “budget constraint”: you have limited “model budget” (complexity); allocate it only where it buys a noticeable performance gain.
---
🚩 Exceptions & Edge Cases
Small sample size ($n$ low): BIC’s $\log n$ penalty may be too harsh; AIC or CV often preferred.
Non‑nested models: Likelihood‑ratio tests are invalid; rely on information criteria or CV.
Highly correlated predictors: Mallows’s $Cp$ can mislead; consider penalized regression (ridge, LASSO) instead.
Non‑linear or non‑Gaussian data: Standard AIC/BIC formulas still apply if the likelihood is correctly specified; otherwise use CV.
---
📍 When to Use Which
AIC – when goal is prediction and you have a moderate‑to‑large $n$; models are not required to be true.
BIC – when you aim for model identification (inference) and sample size is sizable; favors selection consistency.
$k$‑fold CV – whenever you can afford the computation and need a robust estimate of out‑of‑sample error.
PRESS – quick check for linear regression models; good for rapid screening.
Mallows’s $Cp$ – when you have nested linear models and want an unbiased estimate of prediction error.
Bayes factor – in a Bayesian framework or when prior information is essential.
---
👀 Patterns to Recognize
“Score differences < 2” (AIC/BIC) → models are statistically indistinguishable; invoke simplicity.
CV error curve that flattens after a certain complexity → adding more parameters yields diminishing returns.
Consistently higher BIC than AIC across candidates → suggests over‑parameterization relative to sample size.
Mallows’s $Cp$ ≈ $p$ across several models → those models likely have good bias‑variance balance.
---
🗂️ Exam Traps
Choosing the model with the absolute lowest AIC without checking whether the difference is meaningful.
Assuming a likelihood‑ratio test works for non‑nested models – it does not; answer choices that cite it are distractors.
Confusing “penalty = $2k$” with “penalty = $k\log n$” – mix‑ups between AIC and BIC are common.
Selecting a model based solely on the smallest PRESS when the underlying assumptions (linearity, homoscedasticity) are violated.
Interpreting a Bayes factor of 1.5 as strong evidence – it is only anecdotal; look for thresholds (e.g., > 10).
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or