Introduction to Model Selection
Understand the trade‑off between fit and complexity, key tools such as cross‑validation and information criteria, and a systematic workflow for selecting and validating models.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary goal of model selection in statistical or machine learning?
1 of 22
Summary
Model Selection: Choosing the Right Model
Introduction
Model selection is one of the most important tasks in statistics and machine learning. At its core, it answers a fundamental question: how do we choose a model that works well on new data?
The process seems straightforward—fit several models and pick the best one—but there's a hidden danger. A model that looks excellent on the data you used to build it might perform terribly when you apply it to new data. Model selection is about finding the sweet spot: a model complex enough to capture real patterns, but not so complex that it's memorizing noise.
This tension between fitting the data well and keeping the model simple is what makes model selection challenging and essential.
The Core Problem: Underfitting and Overfitting
Underfitting
Underfitting happens when your model is too simple. It cannot capture the true patterns in the data, so both training and testing performance are poor. Imagine trying to fit a straight line to data that actually follows a curve—no matter how much training data you have, your predictions will be systematically wrong.
The key symptom of underfitting is high bias: your model consistently misses the mark because it's making overly rigid assumptions about how the world works.
Overfitting
Overfitting is the opposite problem. Your model is so flexible and complex that it doesn't just learn the true patterns—it also learns the noise and random variation specific to your training data. The result: your model performs brilliantly on training data but poorly on new data.
The key symptom of overfitting is high variance: small changes in the training data lead to wildly different models because the model is fitting random details rather than genuine patterns.
The Trade-off
The goal of model selection is to navigate between these two extremes. You want a model that:
Is complex enough to capture real signals in the data
Is simple enough to ignore noise and avoid fitting random details
This is sometimes called the bias-variance trade-off. More complex models reduce bias (fit the signal better) but increase variance (become more sensitive to noise). Model selection tools help you find where this trade-off is most favorable.
Tools for Evaluating and Selecting Models
Cross-Validation: The Gold Standard
Cross-validation is one of the most reliable and widely used tools for model selection. Here's how it works:
Divide your data into k equal-sized folds (often k = 5 or 10)
For each fold: train your model on the other k-1 folds, then evaluate it on the held-out fold
Average the results across all k folds to get an overall estimate of predictive performance
The genius of cross-validation is that it simulates what will happen with new data. Since you're always testing on data the model hasn't seen during training, cross-validation gives an honest estimate of how well your model will generalize.
Why use cross-validation?
Works with almost any type of model
Directly measures predictive performance, which is often your actual goal
More stable than training/test split on small datasets
The main drawback is that it's computationally expensive—you fit your model k times instead of once.
Information Criteria: Fast Alternatives
Information criteria provide a faster, mathematically elegant alternative to cross-validation. The two most common are:
Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) both combine two components:
$$\text{Information Criterion} = -2 \times \text{log-likelihood} + \text{penalty}$$
The log-likelihood measures how well your model fits the data (higher is better). The penalty increases with the number of parameters, discouraging unnecessary complexity. Together, they balance fit against complexity.
A lower information criterion value indicates a better model. AIC tends to select slightly more complex models than BIC, so the choice between them depends on your priorities.
When to use information criteria:
When you need speed (they compute in one pass)
When your models are fit by maximum likelihood (linear regression, logistic regression, etc.)
When you're comparing models with clear theoretical differences
Adjusted R-squared
For linear regression, adjusted R-squared is a simple diagnostic tool. Unlike ordinary R-squared, which always improves when you add predictors, adjusted R-squared penalizes additional predictors:
$$R^2{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$$
where $n$ is the number of observations and $p$ is the number of predictors.
Use adjusted R-squared when:
Working with linear regression
You want a quick indication of whether adding more predictors actually helps
You need a metric that's easy to explain to non-technical audiences
Regularization: Shrinking Your Way to Simplicity
Regularization takes a different approach to model selection: instead of comparing many models, you fit one model with a tuning parameter that controls complexity.
LASSO (Least Absolute Shrinkage and Selection Operator)
LASSO adds a penalty based on the sum of absolute coefficient values. As you increase the penalty, some coefficients shrink all the way to exactly zero—LASSO automatically performs variable selection. This is especially useful when you have many potential predictors and suspect that only some of them matter.
Ridge Regression
Ridge adds a penalty based on the sum of squared coefficients. Unlike LASSO, coefficients shrink toward zero but don't reach it exactly. Ridge is particularly useful when you have multicollinearity (predictors that are highly correlated with each other), as it stabilizes the estimates.
How to use regularization:
Fit models across a range of penalty values
Use cross-validation to find the penalty that minimizes validation error
Choose the model at that optimal penalty
The Model Selection Workflow
Here's a practical step-by-step process for selecting a model:
Step 1: Specify Candidate Models
Start by defining a set of plausible models. These might differ in:
Which predictors you include
The functional form (linear, polynomial, interaction terms)
The algorithm family (regression, tree, neural network)
Your candidate set should be informed by domain knowledge and the problem context. Don't just try every possible combination—that will almost certainly lead to overfitting.
Step 2: Fit Models to Training Data
Train each candidate model on your training data using the appropriate estimation method. For a given model structure, use standard fitting procedures like ordinary least squares, maximum likelihood, or gradient descent.
Step 3: Evaluate Models
Evaluate each model using one or more of the tools discussed above:
Cross-validation for predictive performance
AIC or BIC for a fast balance between fit and complexity
Adjusted R² if you're working with linear models
Regularization paths if you're using LASSO or Ridge
Step 4: Choose the Best Model
Select the model with the best trade-off. This usually means:
The lowest cross-validated error
The lowest information criterion value
The highest adjusted R-squared
When multiple tools disagree slightly, don't panic—this is normal. Look at which models appear best across multiple criteria.
Step 5: Validate on a Test Set
Once you've selected your model, validate it on a separate test set that was never used during training or model selection. This provides an unbiased estimate of how well it will perform on truly new data. This step is crucial because you've already done many comparisons and decisions, which could lead to subtle overfitting to your training and validation data.
Key Principles for Good Model Selection
Occam's Razor in Statistics
The principle of Occam's Razor states: choose the simplest model that adequately explains the data. In practical terms:
If two models have similar performance, choose the simpler one
Don't add predictors unless they meaningfully improve predictions
Be skeptical of overly complex models unless they're clearly justified
This principle isn't just philosophical—it's practical. Simpler models are easier to understand, communicate, and deploy. They're also less likely to have been fit to noise.
Predictive Versus Explanatory Goals
Your model's purpose affects which tools you should prioritize:
For prediction: emphasize cross-validation, which directly measures out-of-sample accuracy
For explanation: consider information criteria, which help assess model plausibility and aid in inference
For both: use multiple tools to gain confidence
The Importance of a Test Set
A test set is sacred. Use it only at the very end, after you've made all model selection decisions. Here's why:
The moment you use test data to make decisions, it's no longer a true test set—you're implicitly fitting to it
A test set measures realistic performance on truly unseen data
It protects you from subtle overfitting that can happen even with good practices
<extrainfo>
Avoiding Over-Reliance on a Single Tool
Each model selection tool has strengths and weaknesses. Cross-validation requires more computation but directly measures what you care about. Information criteria are fast but make specific statistical assumptions. Adjusted R² is simple but only works for linear models.
Best practice: use multiple tools. If different approaches agree, you have confidence in your choice. If they disagree, investigate why—it often reveals important insights about your models and data.
</extrainfo>
Flashcards
What is the primary goal of model selection in statistical or machine learning?
To choose a model that best balances accurate description of data (fit) with simplicity.
What specifically does model selection aim to achieve regarding new data?
Predicting well on new, unseen data while avoiding unnecessary complexity.
What is the "sweet spot" that model selection aims to find?
The point where a model captures the true signal without being overly sensitive to noise.
When does underfitting occur in a model?
When a model is too simple to capture important patterns in the data.
What are the typical consequences of underfitting on model predictions?
Biased and inaccurate predictions.
When does overfitting occur in a model?
When a model is too complex and fits random noise in the training data.
How does an overfit model typically perform on training data versus new data?
Excellent performance on training data but poor performance on new data.
What two components do information criteria (like AIC and BIC) combine?
Goodness-of-fit (likelihood) and a penalty for the number of parameters.
Does a higher or lower value of an information criterion indicate a better model?
Lower value.
For which types of models are information criteria particularly fast and useful?
Models fitted by maximum likelihood (e.g., linear or logistic regression).
How does Adjusted R-squared differ from ordinary R-squared?
It penalizes the addition of extra predictors.
What is the primary diagnostic purpose of Adjusted R-squared in linear models?
Identifying when additional predictors no longer improve the model.
How does regularization reduce model complexity?
By imposing a penalty on the magnitude of coefficients to shrink them toward zero.
What unique function does LASSO perform during the fitting process?
Simultaneous variable selection and fitting.
What type of penalty does Ridge regression use, and does it set coefficients to zero?
L2 penalty; it does not set coefficients exactly to zero.
What specific data issue does Ridge regression help stabilize estimates for?
Multicollinearity.
What are the standard steps in a model selection workflow?
Specify candidate models
Fit models to training data
Evaluate models
Choose the best model
Validate the final choice
How is the principle of Occam’s razor applied in statistical model selection?
By choosing the simplest model that adequately fits the data over more complex ones.
Which evaluation tool is prioritized for predictive tasks to assess out-of-sample performance?
Cross-validation.
Which evaluation tool is often prioritized for explanatory models to assess plausibility?
Information criteria.
Why is it important to use an independent test set after the selection process?
To provide an unbiased assessment of the model's generalization ability.
Why is combining multiple evaluation tools recommended for model selection?
To guard against the weaknesses of any single method and ensure robust decisions.
Quiz
Introduction to Model Selection Quiz Question 1: In the context of information criteria (AIC, BIC), what does a lower numeric value indicate?
- A better trade‑off between model fit and complexity (correct)
- A higher likelihood but more overfitting
- A poorer predictive performance on new data
- A model with more parameters than necessary
Introduction to Model Selection Quiz Question 2: Which indicator is typically used to choose the best model after evaluation?
- The model with the lowest cross‑validated error (correct)
- The model with the highest training accuracy
- The most complex model among candidates
- The model with the greatest number of predictors
Introduction to Model Selection Quiz Question 3: According to the principle of Occam’s razor in statistics, what should be preferred?
- The simplest model that adequately fits the data (correct)
- The model with the most predictors regardless of fit
- The most complex model to capture every nuance
- The model that overfits the training data for maximum accuracy
In the context of information criteria (AIC, BIC), what does a lower numeric value indicate?
1 of 3
Key Concepts
Model Evaluation Techniques
Model selection
Cross‑validation
Akaike information criterion (AIC)
Bayesian information criterion (BIC)
Adjusted R‑squared
Model Complexity Issues
Underfitting
Overfitting
Occam’s razor (statistical principle)
Regularization Methods
Least absolute shrinkage and selection operator (Lasso)
Ridge regression
Definitions
Model selection
The process of choosing a statistical or machine‑learning model that best balances predictive accuracy with simplicity.
Underfitting
A modeling error where the model is too simple to capture the underlying patterns in the data, leading to biased predictions.
Overfitting
A modeling error where the model is overly complex, fitting noise in the training data and performing poorly on new data.
Cross‑validation
A resampling technique that partitions data into training and validation sets multiple times to estimate a model’s out‑of‑sample performance.
Akaike information criterion (AIC)
An information‑theoretic metric that evaluates model fit while penalizing the number of estimated parameters.
Bayesian information criterion (BIC)
A model‑selection criterion similar to AIC but with a stronger penalty for model complexity, favoring simpler models as sample size grows.
Adjusted R‑squared
A version of the coefficient of determination that adjusts for the number of predictors, discouraging the addition of irrelevant variables.
Least absolute shrinkage and selection operator (Lasso)
A regularization method that adds an L1 penalty to regression coefficients, performing variable selection by shrinking some coefficients to zero.
Ridge regression
A regularization technique that adds an L2 penalty to regression coefficients, reducing their magnitude to mitigate multicollinearity without eliminating variables.
Occam’s razor (statistical principle)
The guideline that, among competing models with similar predictive ability, the simplest one should be preferred.