Model selection - Advanced Methods and Evaluation
Understand advanced model selection methods, key information criteria, and validation/resampling techniques for evaluating models.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary purpose of exploratory data analysis in the context of model building?
1 of 15
Summary
Model Selection: From Candidate Models to Best Fit
Introduction
When building a predictive or explanatory model, you face a fundamental challenge: how do you decide which model is best? Simply choosing the model that fits your data most closely is tempting, but it often leads to overfitting—a model that memorizes noise in your training data and performs poorly on new data.
Model selection is the process of choosing among competing candidate models by balancing two competing goals: (1) fitting the data well, and (2) keeping the model simple. This guide walks you through the methods statisticians and machine learning practitioners use to make this choice rigorously.
The workflow typically follows this pattern: first, define a set of candidate models; second, evaluate them using appropriate criteria; third, validate your choice on unseen data.
Defining Your Candidate Models
Before you can select the best model, you need a reasonable set of candidates to choose from. Three interconnected practices help you construct this set.
Data Transformation
Data transformation involves applying mathematical operations to your raw variables to make them more suitable for modeling. The goal is to make relationships clearer and meet the assumptions of your chosen modeling technique.
Common transformations include:
Logarithmic transformation: Converting a variable $x$ to $\log(x)$ is useful when data spans multiple orders of magnitude or when the relationship between variables is multiplicative rather than additive. For example, income distributions are often right-skewed, and $\log(\text{income})$ creates a more symmetric distribution.
Scaling and standardization: Subtracting the mean and dividing by standard deviation converts a variable to zero mean and unit variance. This is especially important when predictors have very different units or magnitudes, as it can improve numerical stability in algorithms and make coefficients comparable.
Polynomial transformations: Creating variables like $x^2$ or $x^3$ allows your linear model to capture nonlinear relationships.
The motivation for transformation is practical: transformed variables often reveal patterns that raw variables hide, and they help satisfy assumptions about the distribution of errors.
Exploratory Data Analysis
Exploratory data analysis (EDA) is the systematic examination of your data before formal modeling. It involves computing summary statistics, creating visualizations, and investigating relationships between variables.
Through EDA, you might discover that:
A variable has a skewed or multimodal distribution, suggesting a transformation
The relationship between two variables is nonlinear, suggesting polynomial terms or interaction terms
Certain groups in your data behave differently, suggesting stratified models
Important variables are missing or need to be engineered from raw data
For instance, if you plot the relationship between advertising spending and sales revenue and observe a curved pattern rather than a straight line, EDA suggests you should consider a quadratic model.
Model Specification
Model specification means explicitly defining the functional form, variables, and assumptions of each candidate model before you fit it to data. This step prevents p-hacking or data-driven model construction, where you tweak the model based on what fits best (which typically leads to overfitting).
A properly specified model includes:
The equation or formula (e.g., linear, logarithmic, polynomial)
Which variables are included
The assumed error distribution
Any constraints or assumptions
For example, rather than trying dozens of combinations and picking the best, you might specify in advance: "I will fit three candidate models: (1) a simple linear regression, (2) a linear model with quadratic terms, and (3) a linear model with interaction terms."
Information Criteria: Balancing Fit and Complexity
Once you've defined candidate models, the central question becomes: how do you compare them fairly? A model with more parameters always fits the training data better, but this doesn't mean it's the best choice.
Information criteria solve this by penalizing model complexity. Each criterion adds a "penalty term" to the negative log-likelihood to discourage overfitting. The best model minimizes the information criterion—it balances fitting the data well against unnecessary complexity.
Akaike Information Criterion (AIC)
The Akaike Information Criterion is defined as:
$$\text{AIC} = 2k + (-2 \log L)$$
where $k$ is the number of parameters and $\log L$ is the log-likelihood (a measure of how well the model fits the data; higher is better).
Breaking this down:
The $(-2 \log L)$ term measures fit; smaller values indicate better fit to the data
The $2k$ term is the penalty for complexity; it increases linearly with the number of parameters
The motivation for AIC comes from information theory: Akaike showed that minimizing AIC approximately minimizes the expected prediction error on new data. In other words, the model with the lowest AIC tends to generalize best.
Key intuition: AIC says "I'll reward you for fitting the data well, but I'll penalize you 2 points for each additional parameter you add."
When to use AIC: AIC is widely used when your goal is prediction, and it tends to select slightly more complex models.
Bayesian Information Criterion (BIC)
The Bayesian Information Criterion is defined as:
$$\text{BIC} = k \log(n) + (-2 \log L)$$
where $n$ is the sample size.
The difference from AIC is in the penalty term: BIC uses $k \log(n)$ instead of $2k$.
Why the difference matters: When sample size $n$ is large, $\log(n) > 2$, so BIC penalizes complexity more heavily than AIC. This means:
AIC tends to select more complex models
BIC tends to select simpler models
The motivation for BIC comes from Bayesian statistics and model selection theory. BIC has strong theoretical properties for choosing the "true" underlying model when it exists in your candidate set.
When to use BIC: BIC is preferred when your goal is explanation or when you believe the true model is relatively simple. It also works better than AIC with smaller sample sizes.
A concrete example: Suppose you're comparing two models for predicting house prices. Model A (simple) has 3 parameters and $\log L = -500$. Model B (complex) has 10 parameters and $\log L = -480$. With $n = 100$:
$\text{AIC}A = 2(3) - 2(-500) = 1006$
$\text{AIC}B = 2(10) - 2(-480) = 1020$
AIC selects Model A ✓
$\text{BIC}A = 3 \log(100) - 2(-500) = 13.8 + 1000 = 1013.8$
$\text{BIC}B = 10 \log(100) - 2(-480) = 46 + 960 = 1006$
BIC selects Model B
Notice BIC's larger penalty ($\log(100) \approx 4.6$ per parameter) makes it more conservative.
<extrainfo>
Minimum Description Length
The Minimum Description Length (MDL) principle takes a coding-theory perspective: the best model is the one that provides the shortest total encoding of both the data and the model itself.
If you had to send the data and the model to someone else using as few bits as possible, MDL says you should choose whichever approach uses fewer bits. This principle is elegant but less commonly used in practice because calculating the optimal encoding is computationally challenging.
Structural Risk Minimization
Structural Risk Minimization (SRM) comes from statistical learning theory and balances empirical error (how well the model fits the training data) against a complexity term derived from the capacity of your hypothesis space.
Rather than penalizing the number of parameters, SRM penalizes the "richness" of the set of functions your model can represent. Models that can express a wider variety of functions face larger penalties. This is more sophisticated than AIC or BIC but requires more theoretical machinery to implement.
</extrainfo>
Validating Your Model: Resampling Approaches
Information criteria estimate how well a model will generalize, but validation directly measures it by evaluating performance on data the model hasn't seen. This is more reliable but more computationally expensive.
Cross-Validation
Cross-validation is the gold standard for estimating a model's predictive performance. The basic idea is deceptively simple: repeatedly divide your data into training and validation sets, fit the model on the training set, and evaluate it on the validation set. Average the results across all splits.
The most common form is k-fold cross-validation:
Randomly divide your data into $k$ roughly equal-sized folds
For $i = 1$ to $k$:
Treat fold $i$ as the validation set
Train the model on the other $k-1$ folds
Compute the prediction error on fold $i$
Average the $k$ prediction errors to get the cross-validation estimate
If $k = n$ (where $n$ is sample size), this is called leave-one-out cross-validation (LOOCV): leave out one observation, fit the model, predict that observation, and repeat for all observations.
Why cross-validation matters:
It directly estimates generalization error on new data
It's more realistic than just checking fit to training data
It accounts for variation across different data splits
Tradeoff: Cross-validation is computationally expensive (you fit the model $k$ times). With small $k$ (say 5), it's faster but slightly less accurate; with large $k$, it's more accurate but slower. $k = 5$ or $k = 10$ is typical.
<extrainfo>
PRESS Statistic
The Prediction Sum of Squares (PRESS) statistic is related to leave-one-out cross-validation. It measures predictive ability by summing the squared prediction errors from LOOCV:
$$\text{PRESS} = \sum{i=1}^{n} (yi - \hat{y}{-i})^2$$
where $\hat{y}{-i}$ is the prediction for observation $i$ when the model was fitted without that observation.
Lower PRESS values indicate better predictive performance. A useful property: for linear models, PRESS can be computed efficiently without refitting $n$ times, making it less computationally costly than raw LOOCV.
</extrainfo>
Feature Selection and Hyperparameter Optimization
Beyond choosing among pre-specified models, two algorithmic approaches help refine your model:
Feature Selection
Feature selection is the process of choosing which input variables to include in your model. Too many variables lead to overfitting and reduce interpretability; too few miss important information.
The motivation is practical: not all variables are useful. Some may be noise, some may be redundant (highly correlated with other variables), and some may lack predictive power.
Common approaches:
Filter methods: Rank variables by statistical tests (e.g., correlation with the outcome) and keep the top-ranked ones. Fast but ignores interactions between variables.
Wrapper methods: Try different subsets of variables, fit the model to each subset, and evaluate using cross-validation. More accurate but computationally expensive because you fit many models.
Embedded methods: The algorithm selects features as part of fitting (e.g., regularization-based methods like LASSO that automatically shrink some coefficients to zero). Computationally efficient and accounts for variable interactions.
Feature selection interacts with model selection: the best set of features depends on which model you choose, so in practice, feature selection and model selection often happen together.
Hyperparameter Optimization
Many machine learning algorithms have hyperparameters—settings you choose before fitting, such as the regularization strength, tree depth, or learning rate. These control the algorithm's behavior and strongly influence generalization.
Hyperparameter optimization adjusts these settings to minimize cross-validation error. Common approaches:
Grid search: Try a predefined grid of hyperparameter values (e.g., regularization strength = 0.01, 0.1, 1, 10), fit the model for each combination, and pick the best.
Random search: Sample random combinations of hyperparameter values, fit the model, and pick the best. Often more efficient than grid search.
Bayesian optimization: Use previous model evaluations to intelligently guess which hyperparameter values to try next, focusing on promising regions of the hyperparameter space.
The key principle: evaluate hyperparameter choices on validation data (via cross-validation), not on the training data.
Hypothesis Testing and Frequentist Approaches
A classical approach to model selection uses hypothesis tests to assess whether adding or removing variables significantly improves fit.
The likelihood-ratio test compares two nested models (one is a special case of the other) by comparing their likelihoods. If Model A has $kA$ parameters and Model B has $kB > kA$ parameters (with Model A being a restricted version of Model B), the test statistic is:
$$\Lambda = -2 (\log LA - \log LB)$$
Under the null hypothesis that the simpler Model A is true, $\Lambda$ approximately follows a chi-squared distribution with $kB - kA$ degrees of freedom.
A small p-value suggests the added parameters in Model B are statistically significant—the data provides evidence that Model B is better.
Advantages: This approach is rigorous and tests a specific hypothesis.
Disadvantages: It only compares nested models (one must be a special case of the other), and it focuses on statistical significance rather than practical importance or generalization.
<extrainfo>
Mallows's Cp
Mallows's $Cp$ is a classical criterion for selecting regression models. It's defined as:
$$Cp = \frac{\text{RSS}p}{\hat{\sigma}^2} - n + 2p$$
where $\text{RSS}p$ is the residual sum of squares for a model with $p$ parameters, $\hat{\sigma}^2$ is an estimate of error variance, and $n$ is sample size.
The logic: if the model is correct, $Cp \approx p$; if the model is biased (missing important variables), $Cp$ is large. You want to find models where $Cp$ is close to $p$.
Though conceptually important, Mallows's $Cp$ is less commonly used today because AIC and BIC are more general and have stronger theoretical foundations.
Stepwise Regression
Stepwise regression iteratively adds or removes predictors based on an information criterion or hypothesis test.
In forward selection, start with no variables and add variables one at a time, choosing at each step the variable that most improves the criterion.
In backward selection, start with all variables and remove them one at a time, choosing at each step the variable whose removal minimizes the criterion (or harms fit least).
Advantage: Computationally feasible for large numbers of variables.
Disadvantage: Stepwise methods can miss good model subsets because they make greedy choices (locally optimal at each step) and don't explore the full space of subsets. Results are also sensitive to the specific stopping rule.
Bayes Factor
The Bayes factor compares two models from a Bayesian perspective by computing the ratio of their marginal likelihoods (the likelihood averaged over the prior distribution of parameters):
$$\text{BF} = \frac{p(\text{data}|\text{Model 1})}{p(\text{data}|\text{Model 2})}$$
A Bayes factor greater than 1 indicates evidence for Model 1; less than 1 indicates evidence for Model 2. Bayes factors are elegant from a theoretical perspective but require specifying priors over parameters, which introduces subjectivity.
</extrainfo>
Summary: Connecting the Pieces
A practical model selection workflow looks like this:
Define candidates: Use EDA and data transformation to specify a small set of candidate models in advance
Calculate information criteria: Compute AIC and/or BIC for each candidate
Validate: Use cross-validation to estimate generalization error for top contenders
Refine: Apply feature selection or hyperparameter optimization within your chosen model class if needed
Final assessment: Report both training and validation performance to confirm the model generalizes
Remember the central tension: a model that fits training data perfectly is often useless on new data. Information criteria and validation methods solve this by penalizing complexity or directly measuring generalization. The best model is rarely the most complex—it's the one that balances fit and simplicity.
Flashcards
What is the primary purpose of exploratory data analysis in the context of model building?
To examine data patterns, distributions, and relationships to suggest plausible model forms.
Which three components are defined during model specification before fitting data?
Functional form
Variables
Assumptions
Which two statistical tests are commonly used to assess how well models explain data?
Likelihood-ratio test or the Chi-squared test.
How does the Akaike information criterion (AIC) penalize model complexity?
By adding twice the number of parameters to the negative log-likelihood.
How is the complexity penalty calculated in the Bayesian information criterion (BIC)?
The logarithm of the sample size multiplied by the number of parameters is added to the negative log-likelihood.
According to the minimum description length principle, which model should be selected?
The model that yields the shortest total encoding of the data and the model itself.
What two factors does structural risk minimization attempt to balance?
Empirical error and a complexity term derived from the capacity of the hypothesis space.
How are predictors added or removed in stepwise regression?
Sequentially, based on a chosen information criterion or statistical test.
What is the goal of the algorithmic approach known as feature selection?
To choose a subset of input variables to improve model performance and reduce complexity.
What theoretical foundation does statistical learning theory provide for evaluating algorithms?
It evaluates and compares algorithms based on capacity and generalization error.
Why is cross-validation considered a computationally intensive method?
Because it repeatedly splits data into training and validation sets to estimate predictive performance.
How does the prediction sum of squares (PRESS) statistic evaluate a model's predictive ability?
By summing squared prediction errors obtained specifically from leave-one-out cross-validation.
How does Mallows’s cp assess model quality regarding predictors?
It compares the residual sum of squares to an unbiased estimate of prediction error, penalizing models with many predictors.
How does the Bayes factor quantify evidence for one model over another?
By comparing their marginal likelihoods under a Bayesian framework.
In the frequentist paradigm, how is hypothesis testing used for model selection?
To evaluate whether adding or removing parameters significantly improves model fit.
Quiz
Model selection - Advanced Methods and Evaluation Quiz Question 1: How does the Akaike information criterion (AIC) penalize model complexity?
- By adding twice the number of parameters to the negative log‑likelihood (correct)
- By adding the logarithm of the sample size multiplied by the number of parameters
- By subtracting the number of parameters from the log‑likelihood
- By multiplying the number of parameters by the sample size
Model selection - Advanced Methods and Evaluation Quiz Question 2: Which statistical test is commonly used in hypothesis‑testing approaches to determine if adding a parameter significantly improves model fit?
- Likelihood‑ratio test (correct)
- t‑test for differences in means
- ANOVA for comparing group means
- Kolmogorov‑Smirnov test for distribution differences
Model selection - Advanced Methods and Evaluation Quiz Question 3: How does the Bayesian information criterion (BIC) penalize model complexity?
- Adds (log n) × k to the negative log‑likelihood (correct)
- Adds 2 × k to the negative log‑likelihood
- Subtracts the number of observations from the log‑likelihood
- Multiplies the log‑likelihood by the number of parameters
Model selection - Advanced Methods and Evaluation Quiz Question 4: What is a primary drawback of using cross‑validation compared with a single hold‑out test?
- It is computationally intensive (correct)
- It provides biased estimates of predictive performance
- It cannot be used with small data sets
- It requires the model to be linear
Model selection - Advanced Methods and Evaluation Quiz Question 5: A Bayes factor of 5 in favor of Model A over Model B indicates what?
- Stronger evidence supporting Model A (correct)
- Equal evidence for both models
- Evidence favoring Model B
- Insufficient data to compare the models
Model selection - Advanced Methods and Evaluation Quiz Question 6: Which statistical test compares the fit of two nested models by evaluating the difference in their log‑likelihoods?
- Likelihood‑ratio test (correct)
- Chi‑squared goodness‑of‑fit test
- t‑test
- ANOVA
Model selection - Advanced Methods and Evaluation Quiz Question 7: The PRESS statistic is calculated using which resampling technique?
- Leave‑one‑out cross‑validation (correct)
- K‑fold cross‑validation
- Bootstrap sampling
- Simple hold‑out validation
Model selection - Advanced Methods and Evaluation Quiz Question 8: In statistical learning theory, which concept quantifies a model’s expected performance on new, unseen data?
- Generalization error (correct)
- Training error
- Residual sum of squares
- Model capacity
Model selection - Advanced Methods and Evaluation Quiz Question 9: What does a low Mallows’s Cp value suggest about a candidate regression model?
- It likely has good fit with relatively few predictors (correct)
- It overfits the data with many predictors
- It has high bias and low variance
- It is unsuitable because its residual sum of squares is large
Model selection - Advanced Methods and Evaluation Quiz Question 10: Why is it common to apply standardization (zero mean, unit variance) to predictors before fitting a regularized regression model?
- To place all predictors on the same scale so the penalty treats them equally (correct)
- To increase the number of predictors by creating interaction terms
- To reduce the number of observations in the dataset
- To convert categorical variables into numeric form
Model selection - Advanced Methods and Evaluation Quiz Question 11: Which of the following visual tools is most commonly employed in exploratory data analysis to reveal the shape of a variable’s distribution?
- Histogram (correct)
- Confusion matrix
- Receiver operating characteristic (ROC) curve
- Student‑t test
Model selection - Advanced Methods and Evaluation Quiz Question 12: In stepwise regression, predictors are added or removed based on:
- A chosen information criterion or statistical test applied sequentially (correct)
- All predictors being regularized simultaneously (e.g., LASSO)
- Random selection without statistical guidance
- User intuition alone, without quantitative criteria
Model selection - Advanced Methods and Evaluation Quiz Question 13: According to the Minimum Description Length principle, the preferred model is the one that:
- Minimizes the combined length of encoding the model and the data (correct)
- Maximizes the likelihood of the observed data regardless of model size
- Uses the fewest parameters irrespective of fit
- Has the lowest training error among all candidates
Model selection - Advanced Methods and Evaluation Quiz Question 14: Which of the following is an example of a hyperparameter commonly tuned during hyperparameter optimization?
- Tree depth in a decision‑tree model (correct)
- Estimated regression coefficients
- Observed values of the response variable
- Number of records in the training dataset
How does the Akaike information criterion (AIC) penalize model complexity?
1 of 14
Key Concepts
Model Selection Criteria
Akaike information criterion
Bayesian information criterion
Minimum description length
Mallows’s Cp
Bayes factor
Likelihood‑ratio test
Model Evaluation Techniques
Cross‑validation
Hyperparameter optimization
Feature selection
Structural risk minimization
Definitions
Akaike information criterion
A metric that balances model fit and complexity by adding twice the number of parameters to the negative log‑likelihood.
Bayesian information criterion
An information criterion that penalizes model complexity with a term equal to the log of the sample size times the number of parameters.
Minimum description length
A principle that selects the model yielding the shortest combined encoding of the data and the model itself.
Structural risk minimization
A learning framework that trades off empirical error against a complexity term derived from the hypothesis space capacity.
Cross‑validation
A resampling method that repeatedly partitions data into training and validation sets to estimate predictive performance.
Hyperparameter optimization
The process of tuning algorithm settings (e.g., regularization strength, tree depth) to achieve optimal model performance.
Feature selection
An algorithmic technique that chooses a subset of input variables to improve model accuracy and reduce complexity.
Bayes factor
A Bayesian statistic that compares the marginal likelihoods of two models to quantify evidence in favor of one over the other.
Mallows’s Cp
A criterion that assesses model quality by comparing residual sum of squares to an unbiased estimate of prediction error, penalizing excess predictors.
Likelihood‑ratio test
A hypothesis‑testing method that evaluates whether a more complex model provides a significantly better fit than a simpler one.