Dimensionality Reduction Strategies
Understand the purposes and strategies of feature selection, the benefits of reduced dimensionality, and how feature projection techniques such as PCA transform data.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary goal of feature selection methods in data preprocessing?
1 of 7
Summary
Feature Selection and Dimensionality Reduction
Introduction
When building machine learning models, you often work with datasets containing many features (input variables). However, not all features are equally useful for prediction. Feature selection and feature projection are two complementary approaches to reducing the number of features in your dataset. Understanding these techniques is essential because working with fewer, more informative features can lead to more accurate models, faster training times, and better interpretability.
Feature Selection Methods
What Is Feature Selection?
Feature selection is the process of identifying and selecting a subset of the most relevant features from the original set of input variables. Importantly, feature selection keeps the original features unchanged—it simply removes the less useful ones. This is different from feature projection, which transforms the features into a new space.
The key motivation for feature selection is that not all features contribute equally to the prediction task. Some features may be noisy, redundant, or irrelevant. By removing them, we can improve model performance and reduce computational costs.
The Three Main Feature Selection Strategies
Feature selection methods fall into three broad categories, each with different approaches to identifying which features to keep.
Filter Strategy
The filter strategy evaluates each feature independently, without considering any specific machine learning algorithm. Features are scored based on their intrinsic properties—typically how much information they provide about the target variable.
A common example is information gain, which measures how much knowing a feature reduces uncertainty about the target. Features with high information gain are more useful for prediction. The filter strategy works as follows:
Calculate a relevance score for each feature (such as information gain or correlation)
Rank the features by their scores
Select the top-k features or features that exceed a threshold
Advantages: Filter methods are computationally efficient because they don't require training machine learning models. They work quickly even on datasets with thousands of features.
Disadvantages: Filter methods ignore interactions between features and don't account for which algorithm you'll actually use for prediction. A feature might be useful when combined with another feature, but this interaction isn't captured.
Wrapper Strategy
The wrapper strategy takes a different approach: it treats feature selection as a search problem where different subsets of features are evaluated by actually training a machine learning algorithm and measuring its predictive accuracy.
The wrapper strategy works like this:
Start with a subset of features (either all features or none)
Train a model and evaluate its performance (accuracy, error, etc.)
Add or remove features and retrain
Continue until no improvement is found
Advantages: Because wrapper methods evaluate features in the context of an actual learning algorithm, they capture feature interactions and will find the features most useful for your specific prediction task and model.
Disadvantages: Wrapper methods are computationally expensive. If you have 100 features, testing all possible subsets requires evaluating $2^{100}$ combinations, which is infeasible. Thus, wrapper methods typically use heuristic search strategies (like forward selection or backward elimination) that don't guarantee the optimal subset.
The key insight: wrapper methods are more accurate but slower; filter methods are fast but may miss useful feature interactions.
Embedded Strategy
The embedded strategy integrates feature selection directly into the model building process. Features are added or removed while the model is being trained, based on prediction errors or model coefficients.
A classic example is regularization in linear models. L1 regularization (Lasso) can automatically shrink some feature coefficients to exactly zero, effectively removing those features during training. Tree-based models like decision trees naturally perform embedded feature selection by using only the most informative features to split nodes.
Advantages: Embedded methods are computationally more efficient than wrapper methods while still considering how features interact with the learning algorithm. They're computationally cheaper and often built into standard algorithms.
Disadvantages: Embedded methods are specific to particular learning algorithms and may not transfer well if you switch models.
Why Feature Selection Works
Working with fewer, better-chosen features often produces more accurate models than using all original features. This happens for several reasons:
Reducing noise: Irrelevant features introduce noise that confuses the model
Improving generalization: Models trained on fewer relevant features often generalize better to new data
Computational efficiency: Training and predictions are faster with fewer features
Interpretability: Models are easier to understand when they use fewer features
Feature Projection Methods
What Is Feature Projection?
Feature projection (also called feature extraction) takes a different approach than feature selection. Instead of choosing a subset of original features, projection methods transform all the data from the original high-dimensional space into a new, lower-dimensional space.
The key difference: feature selection subsets features, while feature projection combines features.
For example, instead of selecting 5 features from 20, feature projection might create 5 new features that are mathematical combinations of all 20 original features. These new features are chosen to preserve the most important information in the data.
Principal Component Analysis: A Linear Projection Example
Principal Component Analysis (PCA) is the most commonly used linear projection technique. PCA finds new directions (called principal components) in the feature space that capture the most variance in the data.
Here's the intuition: imagine plotting 2D data on a scatter plot. PCA finds the direction along which the data spreads out the most. This direction becomes the first principal component. Then it finds a perpendicular direction with the second-most variance, and so on.
Why maximize variance? Variance indicates where the meaningful information lies. By projecting data onto high-variance directions, PCA preserves the structure and patterns in the data while using fewer dimensions.
For example, PCA might reduce 100 original features to 10 principal components while preserving 95% of the variance in the data. These 10 components can then be used for prediction.
Key property: PCA is a linear transformation—each new feature is a weighted linear combination of the original features.
<extrainfo>
Nonlinear Projection Methods
While PCA uses linear transformations, many modern dimensionality-reduction techniques use nonlinear transformations. Techniques like t-SNE, autoencoders, and manifold learning methods can be more effective for certain types of data where the relationship between features is highly nonlinear. However, linear methods like PCA remain the most commonly used in practice due to their simplicity and interpretability.
</extrainfo>
Summary: Feature Selection vs. Feature Projection
To solidify your understanding, here's how these approaches differ:
Feature Selection: Removes features. Keeps original features interpretable. Methods: filter, wrapper, embedded.
Feature Projection: Transforms all features into new combinations. Creates new derived features that may be harder to interpret but often more informative. Key example: PCA.
The choice between them depends on your goals. Use feature selection when you want interpretability and understanding of which original variables matter. Use feature projection when you want to maximize predictive accuracy and don't mind working with transformed features.
Flashcards
What is the primary goal of feature selection methods in data preprocessing?
To find a subset of the original input variables (features).
What are the three main strategies used for feature selection?
Filter strategy
Wrapper strategy
Embedded strategy
How does the filter strategy evaluate features?
Independently of any learning algorithm, using criteria like information gain.
What guide does the wrapper strategy use to search for a subset of features?
The predictive accuracy of a specific learning algorithm.
When are features added or removed in an embedded strategy?
While the model is being built, based on prediction errors.
How does feature projection (feature extraction) handle high-dimensional data?
It transforms the data from the high-dimensional space into a space with fewer dimensions.
Which linear projection technique aims to maximize the variance of the projected data?
Principal Component Analysis (PCA).
Quiz
Dimensionality Reduction Strategies Quiz Question 1: What does principal component analysis (PCA) aim to maximize when projecting data?
- The variance of the projected data (correct)
- The number of dimensions
- The distance between data points
- The classification accuracy
Dimensionality Reduction Strategies Quiz Question 2: What is the primary goal of feature selection methods?
- To find a subset of the original input variables (features) (correct)
- To increase the number of input variables for better model complexity
- To combine features into a single composite variable irrespective of relevance
- To randomly select features without evaluation
Dimensionality Reduction Strategies Quiz Question 3: What does feature projection achieve in dimensionality reduction?
- It transforms data to a lower‑dimensional space (correct)
- It increases dimensionality to capture more detail
- It eliminates all features leaving only the target variable
- It duplicates existing features to create redundancy
Dimensionality Reduction Strategies Quiz Question 4: Which of the following criteria is commonly used by the filter strategy to evaluate each feature independently of any learning algorithm?
- Information gain (correct)
- Model accuracy
- Cross‑validation error
- Regularization penalty
Dimensionality Reduction Strategies Quiz Question 5: A typical advantage of applying feature selection before building a regression or classification model is that the resulting model becomes:
- Simpler and more interpretable (correct)
- Larger and more complex
- Less robust to noise
- More dependent on specific algorithms
Dimensionality Reduction Strategies Quiz Question 6: What is true about many nonlinear dimensionality‑reduction techniques?
- They also perform feature projection (correct)
- They always require linear transformations
- They cannot preserve local structure
- They are equivalent to PCA
Dimensionality Reduction Strategies Quiz Question 7: What defines the embedded strategy for feature selection?
- Features are added or removed during model training based on prediction errors (correct)
- Features are pre‑selected before training using cross‑validation scores
- Features are chosen by exhaustive search of all possible subsets
- Features are selected solely based on their correlation with the target variable
What does principal component analysis (PCA) aim to maximize when projecting data?
1 of 7
Key Concepts
Dimensionality Reduction Techniques
Dimensionality reduction
Feature projection
Principal component analysis (PCA)
Nonlinear dimensionality reduction
Feature Selection Strategies
Feature selection
Filter strategy
Wrapper strategy
Embedded strategy
Definitions
Dimensionality reduction
Techniques that transform high‑dimensional data into a lower‑dimensional representation while preserving essential information.
Feature selection
The process of identifying a subset of original variables that contribute most to predictive performance.
Filter strategy
A feature‑selection approach that evaluates each variable independently of any learning algorithm, often using statistical criteria.
Wrapper strategy
A feature‑selection method that searches for optimal subsets by iteratively training a specific model and measuring its accuracy.
Embedded strategy
A feature‑selection technique that incorporates variable selection directly into the model‑training process.
Feature projection
The transformation of data into a new space with fewer dimensions, typically by combining original features.
Principal component analysis (PCA)
A linear projection method that re‑expresses data along orthogonal axes maximizing variance.
Nonlinear dimensionality reduction
A class of methods that map data to lower dimensions using nonlinear transformations, preserving complex structures.