RemNote Community
Community

Model selection - Advanced Methods and Evaluation

Understand advanced model selection methods, key information criteria, and validation/resampling techniques for evaluating models.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the primary purpose of exploratory data analysis in the context of model building?
1 of 15

Summary

Model Selection: From Candidate Models to Best Fit Introduction When building a predictive or explanatory model, you face a fundamental challenge: how do you decide which model is best? Simply choosing the model that fits your data most closely is tempting, but it often leads to overfitting—a model that memorizes noise in your training data and performs poorly on new data. Model selection is the process of choosing among competing candidate models by balancing two competing goals: (1) fitting the data well, and (2) keeping the model simple. This guide walks you through the methods statisticians and machine learning practitioners use to make this choice rigorously. The workflow typically follows this pattern: first, define a set of candidate models; second, evaluate them using appropriate criteria; third, validate your choice on unseen data. Defining Your Candidate Models Before you can select the best model, you need a reasonable set of candidates to choose from. Three interconnected practices help you construct this set. Data Transformation Data transformation involves applying mathematical operations to your raw variables to make them more suitable for modeling. The goal is to make relationships clearer and meet the assumptions of your chosen modeling technique. Common transformations include: Logarithmic transformation: Converting a variable $x$ to $\log(x)$ is useful when data spans multiple orders of magnitude or when the relationship between variables is multiplicative rather than additive. For example, income distributions are often right-skewed, and $\log(\text{income})$ creates a more symmetric distribution. Scaling and standardization: Subtracting the mean and dividing by standard deviation converts a variable to zero mean and unit variance. This is especially important when predictors have very different units or magnitudes, as it can improve numerical stability in algorithms and make coefficients comparable. Polynomial transformations: Creating variables like $x^2$ or $x^3$ allows your linear model to capture nonlinear relationships. The motivation for transformation is practical: transformed variables often reveal patterns that raw variables hide, and they help satisfy assumptions about the distribution of errors. Exploratory Data Analysis Exploratory data analysis (EDA) is the systematic examination of your data before formal modeling. It involves computing summary statistics, creating visualizations, and investigating relationships between variables. Through EDA, you might discover that: A variable has a skewed or multimodal distribution, suggesting a transformation The relationship between two variables is nonlinear, suggesting polynomial terms or interaction terms Certain groups in your data behave differently, suggesting stratified models Important variables are missing or need to be engineered from raw data For instance, if you plot the relationship between advertising spending and sales revenue and observe a curved pattern rather than a straight line, EDA suggests you should consider a quadratic model. Model Specification Model specification means explicitly defining the functional form, variables, and assumptions of each candidate model before you fit it to data. This step prevents p-hacking or data-driven model construction, where you tweak the model based on what fits best (which typically leads to overfitting). A properly specified model includes: The equation or formula (e.g., linear, logarithmic, polynomial) Which variables are included The assumed error distribution Any constraints or assumptions For example, rather than trying dozens of combinations and picking the best, you might specify in advance: "I will fit three candidate models: (1) a simple linear regression, (2) a linear model with quadratic terms, and (3) a linear model with interaction terms." Information Criteria: Balancing Fit and Complexity Once you've defined candidate models, the central question becomes: how do you compare them fairly? A model with more parameters always fits the training data better, but this doesn't mean it's the best choice. Information criteria solve this by penalizing model complexity. Each criterion adds a "penalty term" to the negative log-likelihood to discourage overfitting. The best model minimizes the information criterion—it balances fitting the data well against unnecessary complexity. Akaike Information Criterion (AIC) The Akaike Information Criterion is defined as: $$\text{AIC} = 2k + (-2 \log L)$$ where $k$ is the number of parameters and $\log L$ is the log-likelihood (a measure of how well the model fits the data; higher is better). Breaking this down: The $(-2 \log L)$ term measures fit; smaller values indicate better fit to the data The $2k$ term is the penalty for complexity; it increases linearly with the number of parameters The motivation for AIC comes from information theory: Akaike showed that minimizing AIC approximately minimizes the expected prediction error on new data. In other words, the model with the lowest AIC tends to generalize best. Key intuition: AIC says "I'll reward you for fitting the data well, but I'll penalize you 2 points for each additional parameter you add." When to use AIC: AIC is widely used when your goal is prediction, and it tends to select slightly more complex models. Bayesian Information Criterion (BIC) The Bayesian Information Criterion is defined as: $$\text{BIC} = k \log(n) + (-2 \log L)$$ where $n$ is the sample size. The difference from AIC is in the penalty term: BIC uses $k \log(n)$ instead of $2k$. Why the difference matters: When sample size $n$ is large, $\log(n) > 2$, so BIC penalizes complexity more heavily than AIC. This means: AIC tends to select more complex models BIC tends to select simpler models The motivation for BIC comes from Bayesian statistics and model selection theory. BIC has strong theoretical properties for choosing the "true" underlying model when it exists in your candidate set. When to use BIC: BIC is preferred when your goal is explanation or when you believe the true model is relatively simple. It also works better than AIC with smaller sample sizes. A concrete example: Suppose you're comparing two models for predicting house prices. Model A (simple) has 3 parameters and $\log L = -500$. Model B (complex) has 10 parameters and $\log L = -480$. With $n = 100$: $\text{AIC}A = 2(3) - 2(-500) = 1006$ $\text{AIC}B = 2(10) - 2(-480) = 1020$ AIC selects Model A ✓ $\text{BIC}A = 3 \log(100) - 2(-500) = 13.8 + 1000 = 1013.8$ $\text{BIC}B = 10 \log(100) - 2(-480) = 46 + 960 = 1006$ BIC selects Model B Notice BIC's larger penalty ($\log(100) \approx 4.6$ per parameter) makes it more conservative. <extrainfo> Minimum Description Length The Minimum Description Length (MDL) principle takes a coding-theory perspective: the best model is the one that provides the shortest total encoding of both the data and the model itself. If you had to send the data and the model to someone else using as few bits as possible, MDL says you should choose whichever approach uses fewer bits. This principle is elegant but less commonly used in practice because calculating the optimal encoding is computationally challenging. Structural Risk Minimization Structural Risk Minimization (SRM) comes from statistical learning theory and balances empirical error (how well the model fits the training data) against a complexity term derived from the capacity of your hypothesis space. Rather than penalizing the number of parameters, SRM penalizes the "richness" of the set of functions your model can represent. Models that can express a wider variety of functions face larger penalties. This is more sophisticated than AIC or BIC but requires more theoretical machinery to implement. </extrainfo> Validating Your Model: Resampling Approaches Information criteria estimate how well a model will generalize, but validation directly measures it by evaluating performance on data the model hasn't seen. This is more reliable but more computationally expensive. Cross-Validation Cross-validation is the gold standard for estimating a model's predictive performance. The basic idea is deceptively simple: repeatedly divide your data into training and validation sets, fit the model on the training set, and evaluate it on the validation set. Average the results across all splits. The most common form is k-fold cross-validation: Randomly divide your data into $k$ roughly equal-sized folds For $i = 1$ to $k$: Treat fold $i$ as the validation set Train the model on the other $k-1$ folds Compute the prediction error on fold $i$ Average the $k$ prediction errors to get the cross-validation estimate If $k = n$ (where $n$ is sample size), this is called leave-one-out cross-validation (LOOCV): leave out one observation, fit the model, predict that observation, and repeat for all observations. Why cross-validation matters: It directly estimates generalization error on new data It's more realistic than just checking fit to training data It accounts for variation across different data splits Tradeoff: Cross-validation is computationally expensive (you fit the model $k$ times). With small $k$ (say 5), it's faster but slightly less accurate; with large $k$, it's more accurate but slower. $k = 5$ or $k = 10$ is typical. <extrainfo> PRESS Statistic The Prediction Sum of Squares (PRESS) statistic is related to leave-one-out cross-validation. It measures predictive ability by summing the squared prediction errors from LOOCV: $$\text{PRESS} = \sum{i=1}^{n} (yi - \hat{y}{-i})^2$$ where $\hat{y}{-i}$ is the prediction for observation $i$ when the model was fitted without that observation. Lower PRESS values indicate better predictive performance. A useful property: for linear models, PRESS can be computed efficiently without refitting $n$ times, making it less computationally costly than raw LOOCV. </extrainfo> Feature Selection and Hyperparameter Optimization Beyond choosing among pre-specified models, two algorithmic approaches help refine your model: Feature Selection Feature selection is the process of choosing which input variables to include in your model. Too many variables lead to overfitting and reduce interpretability; too few miss important information. The motivation is practical: not all variables are useful. Some may be noise, some may be redundant (highly correlated with other variables), and some may lack predictive power. Common approaches: Filter methods: Rank variables by statistical tests (e.g., correlation with the outcome) and keep the top-ranked ones. Fast but ignores interactions between variables. Wrapper methods: Try different subsets of variables, fit the model to each subset, and evaluate using cross-validation. More accurate but computationally expensive because you fit many models. Embedded methods: The algorithm selects features as part of fitting (e.g., regularization-based methods like LASSO that automatically shrink some coefficients to zero). Computationally efficient and accounts for variable interactions. Feature selection interacts with model selection: the best set of features depends on which model you choose, so in practice, feature selection and model selection often happen together. Hyperparameter Optimization Many machine learning algorithms have hyperparameters—settings you choose before fitting, such as the regularization strength, tree depth, or learning rate. These control the algorithm's behavior and strongly influence generalization. Hyperparameter optimization adjusts these settings to minimize cross-validation error. Common approaches: Grid search: Try a predefined grid of hyperparameter values (e.g., regularization strength = 0.01, 0.1, 1, 10), fit the model for each combination, and pick the best. Random search: Sample random combinations of hyperparameter values, fit the model, and pick the best. Often more efficient than grid search. Bayesian optimization: Use previous model evaluations to intelligently guess which hyperparameter values to try next, focusing on promising regions of the hyperparameter space. The key principle: evaluate hyperparameter choices on validation data (via cross-validation), not on the training data. Hypothesis Testing and Frequentist Approaches A classical approach to model selection uses hypothesis tests to assess whether adding or removing variables significantly improves fit. The likelihood-ratio test compares two nested models (one is a special case of the other) by comparing their likelihoods. If Model A has $kA$ parameters and Model B has $kB > kA$ parameters (with Model A being a restricted version of Model B), the test statistic is: $$\Lambda = -2 (\log LA - \log LB)$$ Under the null hypothesis that the simpler Model A is true, $\Lambda$ approximately follows a chi-squared distribution with $kB - kA$ degrees of freedom. A small p-value suggests the added parameters in Model B are statistically significant—the data provides evidence that Model B is better. Advantages: This approach is rigorous and tests a specific hypothesis. Disadvantages: It only compares nested models (one must be a special case of the other), and it focuses on statistical significance rather than practical importance or generalization. <extrainfo> Mallows's Cp Mallows's $Cp$ is a classical criterion for selecting regression models. It's defined as: $$Cp = \frac{\text{RSS}p}{\hat{\sigma}^2} - n + 2p$$ where $\text{RSS}p$ is the residual sum of squares for a model with $p$ parameters, $\hat{\sigma}^2$ is an estimate of error variance, and $n$ is sample size. The logic: if the model is correct, $Cp \approx p$; if the model is biased (missing important variables), $Cp$ is large. You want to find models where $Cp$ is close to $p$. Though conceptually important, Mallows's $Cp$ is less commonly used today because AIC and BIC are more general and have stronger theoretical foundations. Stepwise Regression Stepwise regression iteratively adds or removes predictors based on an information criterion or hypothesis test. In forward selection, start with no variables and add variables one at a time, choosing at each step the variable that most improves the criterion. In backward selection, start with all variables and remove them one at a time, choosing at each step the variable whose removal minimizes the criterion (or harms fit least). Advantage: Computationally feasible for large numbers of variables. Disadvantage: Stepwise methods can miss good model subsets because they make greedy choices (locally optimal at each step) and don't explore the full space of subsets. Results are also sensitive to the specific stopping rule. Bayes Factor The Bayes factor compares two models from a Bayesian perspective by computing the ratio of their marginal likelihoods (the likelihood averaged over the prior distribution of parameters): $$\text{BF} = \frac{p(\text{data}|\text{Model 1})}{p(\text{data}|\text{Model 2})}$$ A Bayes factor greater than 1 indicates evidence for Model 1; less than 1 indicates evidence for Model 2. Bayes factors are elegant from a theoretical perspective but require specifying priors over parameters, which introduces subjectivity. </extrainfo> Summary: Connecting the Pieces A practical model selection workflow looks like this: Define candidates: Use EDA and data transformation to specify a small set of candidate models in advance Calculate information criteria: Compute AIC and/or BIC for each candidate Validate: Use cross-validation to estimate generalization error for top contenders Refine: Apply feature selection or hyperparameter optimization within your chosen model class if needed Final assessment: Report both training and validation performance to confirm the model generalizes Remember the central tension: a model that fits training data perfectly is often useless on new data. Information criteria and validation methods solve this by penalizing complexity or directly measuring generalization. The best model is rarely the most complex—it's the one that balances fit and simplicity.
Flashcards
What is the primary purpose of exploratory data analysis in the context of model building?
To examine data patterns, distributions, and relationships to suggest plausible model forms.
Which three components are defined during model specification before fitting data?
Functional form Variables Assumptions
Which two statistical tests are commonly used to assess how well models explain data?
Likelihood-ratio test or the Chi-squared test.
How does the Akaike information criterion (AIC) penalize model complexity?
By adding twice the number of parameters to the negative log-likelihood.
How is the complexity penalty calculated in the Bayesian information criterion (BIC)?
The logarithm of the sample size multiplied by the number of parameters is added to the negative log-likelihood.
According to the minimum description length principle, which model should be selected?
The model that yields the shortest total encoding of the data and the model itself.
What two factors does structural risk minimization attempt to balance?
Empirical error and a complexity term derived from the capacity of the hypothesis space.
How are predictors added or removed in stepwise regression?
Sequentially, based on a chosen information criterion or statistical test.
What is the goal of the algorithmic approach known as feature selection?
To choose a subset of input variables to improve model performance and reduce complexity.
What theoretical foundation does statistical learning theory provide for evaluating algorithms?
It evaluates and compares algorithms based on capacity and generalization error.
Why is cross-validation considered a computationally intensive method?
Because it repeatedly splits data into training and validation sets to estimate predictive performance.
How does the prediction sum of squares (PRESS) statistic evaluate a model's predictive ability?
By summing squared prediction errors obtained specifically from leave-one-out cross-validation.
How does Mallows’s cp assess model quality regarding predictors?
It compares the residual sum of squares to an unbiased estimate of prediction error, penalizing models with many predictors.
How does the Bayes factor quantify evidence for one model over another?
By comparing their marginal likelihoods under a Bayesian framework.
In the frequentist paradigm, how is hypothesis testing used for model selection?
To evaluate whether adding or removing parameters significantly improves model fit.

Quiz

How does the Akaike information criterion (AIC) penalize model complexity?
1 of 14
Key Concepts
Model Selection Criteria
Akaike information criterion
Bayesian information criterion
Minimum description length
Mallows’s Cp
Bayes factor
Likelihood‑ratio test
Model Evaluation Techniques
Cross‑validation
Hyperparameter optimization
Feature selection
Structural risk minimization