Subjects/Technology/Data and AI/Machine Learning/Dimensionality reduction

Techniques and Applications of Dimensionality Reduction

Understand linear projection methods, nonlinear manifold learning techniques, and practical applications like preprocessing for nearest‑neighbor algorithms.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

Which two types of matrices are constructed in Principal Component Analysis to compute eigenvectors?

1 of 25

Summary

Linear Projection Techniques for Dimensionality Reduction Introduction: Why Dimensionality Reduction Matters Dimensionality reduction is a fundamental technique in machine learning and data analysis that transforms high-dimensional data into lower-dimensional representations while preserving important information. The motivation is twofold: first, to improve computational efficiency and reduce memory requirements, and second, to avoid the curse of dimensionality—the phenomenon where algorithms perform poorly in very high-dimensional spaces because distances between points become increasingly similar. The key challenge is deciding what information to preserve. Different techniques preserve different types of information: some prioritize overall variance in the data, others focus on class separation, and still others preserve local neighborhood structure. Understanding these distinctions will help you choose the right technique for your problem. Principal Component Analysis Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It works by constructing the covariance matrix of your data and computing its eigenvectors. Here's the core insight: when you compute the eigenvalues and eigenvectors of the covariance matrix, the eigenvectors point along directions in your data space, and the corresponding eigenvalues tell you how much variance exists along each of those directions. The principal components are simply the eigenvectors ordered by their eigenvalues from largest to smallest. Why is this useful? Because variance often corresponds to important structure in your data. The first principal component captures the direction of maximum variance in your data, the second component captures the maximum variance perpendicular to the first, and so on. By retaining only the top $k$ eigenvectors (those with the largest eigenvalues), you can project your $d$-dimensional data into a $k$-dimensional subspace where $k < d$, typically keeping 90-95% of the total variance while dramatically reducing dimensionality. A particularly valuable property of PCA is that the first few components often represent large-scale physical behavior or important phenomena in your system because they contain the majority of the system's energy or information content. Important note about PCA: The technique works by centering the data (subtracting the mean), which can be problematic for data that must remain non-negative, such as physical flux measurements. This is where other techniques become relevant. Linear Discriminant Analysis While PCA is unsupervised (it ignores class labels and just maximizes variance), Linear Discriminant Analysis (LDA) is a supervised technique that uses class information to guide dimensionality reduction. LDA finds linear combinations of features that best separate two or more classes of objects or events. Rather than maximizing overall variance, LDA maximizes the ratio of between-class variance to within-class variance. This means LDA specifically identifies directions that discriminate well between your classes. The key difference from PCA: if two classes are well-separated along a high-variance direction, PCA will choose that direction. But if the classes are well-separated along a low-variance direction, LDA will prefer the low-variance direction because it's more useful for discrimination. This makes LDA particularly effective when you have labeled data and your goal is classification. Non-Negative Matrix Factorization Non-Negative Matrix Factorization (NMF) decomposes a non-negative data matrix $X$ (where all entries are $\geq 0$) into the product of two non-negative matrices: $X \approx WH$, where $W$ and $H$ have no negative entries. The practical advantage of NMF compared to PCA is that it preserves physical non-negative fluxes and quantities. Recall that PCA subtracts the mean from your data—this can produce negative values even when your original measurements are fundamentally non-negative (like light intensities, chemical concentrations, or emission rates). NMF respects the non-negativity constraint throughout, which is crucial for interpretability in applications like hyperspectral imaging, text analysis, or any domain where negative values are physically meaningless. The lower-dimensional factors learned by NMF tend to be more interpretable and correspond to meaningful physical components in your data. Nonlinear Projection Techniques The Need for Nonlinear Methods Linear techniques like PCA work well when your data's important structure is captured by linear combinations of original features. However, real-world data often has complex, nonlinear structure that linear projections cannot capture. Imagine data distributed along a curved surface in high-dimensional space—a linear projection might collapse this surface into a pancake, destroying the underlying structure. Nonlinear projection techniques maintain the relevant geometric properties of your data by using either nonlinear transformations or graph-based approaches. The following techniques represent two main strategies: those using kernels (extending linear methods nonlinearly) and those using manifold learning (preserving local neighborhood properties). Kernel Principal Component Analysis Kernel Principal Component Analysis (KPCA) extends PCA through the kernel trick, a technique borrowed from support vector machines. Instead of working with the original features, KPCA implicitly maps data into a high-dimensional (possibly infinite-dimensional) space where linear relationships can capture nonlinear patterns in the original space. The result: KPCA can construct nonlinear mappings that maximize variance, capturing curved structures that linear PCA would miss. The "kernel" is a function that measures similarity between points, and different choices of kernels (Gaussian, polynomial, etc.) capture different types of nonlinear relationships. Manifold Learning: Philosophy and Core Concept The family of manifold learning techniques approaches dimensionality reduction from a different philosophical angle. These methods assume your high-dimensional data actually lies on or near a lower-dimensional manifold (a curved, lower-dimensional surface). They construct low-dimensional representations that preserve local data properties using graph-based cost functions. The key principle: maintain relationships among nearby points. If two points are close in the high-dimensional space, they should remain close in the low-dimensional embedding. This local preservation often reveals the underlying manifold structure better than global variance-based methods. Isomap Isomap preserves geodesic distances between data points. A geodesic distance is the shortest path along the manifold between two points (imagine walking along the surface of a rolled-up piece of paper, not cutting through the paper). Isomap constructs a nearest-neighbor graph, computes shortest paths along this graph (approximating geodesics), and then embeds the data to preserve these distances. Why geodesic distances matter: two points might be far apart in straight-line distance but close in geodesic distance if the manifold curves between them. Isomap recognizes and respects this curved geometry. Locally Linear Embedding Locally Linear Embedding (LLE) takes a different approach to local preservation. Rather than preserving distances, LLE preserves relationships among each point and its nearest neighbors. Specifically, it finds weights that best reconstruct each point as a linear combination of its neighbors in the high-dimensional space, then applies these same weights to reconstruct points in the low-dimensional embedding. The intuition: if point $i$ is roughly a weighted average of its neighbors, it should remain so in the lower-dimensional representation. This captures local structure without explicitly computing distances. Laplacian Eigenmaps Laplacian Eigenmaps use the graph Laplacian (a matrix derived from the nearest-neighbor graph) to preserve local neighborhood structure. Points that are close to each other in the original space are penalized if they become far apart in the embedding, encouraging the low-dimensional representation to maintain local clusters and neighborhoods. Maximum Variance Unfolding Maximum Variance Unfolding (MVU) aims to simultaneously preserve and enhance structure. It exactly preserves all pairwise distances between nearest neighbors while simultaneously maximizing distances between non-neighbors. This "unfolding" creates an embedding where the manifold structure becomes maximally spread out and visible. <extrainfo> Diffusion Maps Diffusion Maps use diffusion distances derived from random walks on the data graph to embed the data. Rather than using Euclidean distances or geodesics, diffusion distance measures how well-connected two points are through the graph structure. Points that are connected through many short paths are considered close, which can reveal cluster structure. Classical Multidimensional Scaling Classical Multidimensional Scaling (MDS) preserves pairwise distances between all points (not just nearest neighbors). Interestingly, classical MDS is mathematically equivalent to PCA when applied to Euclidean distance matrices—it's the same solution expressed differently. </extrainfo> Visualizing Complex Relationships: t-SNE and UMAP t-Distributed Stochastic Neighbor Embedding t-Distributed Stochastic Neighbor Embedding (t-SNE) takes yet another approach: it minimizes the divergence between probability distributions over pairs of points. In other words, if two points are close in the high-dimensional space, they should be close with high probability in the low-dimensional embedding, and vice versa. Critical caveat: t-SNE is designed primarily for data visualization. It excels at creating beautiful, interpretable 2D or 3D plots where clusters and structure are clearly visible. However, it is not recommended for clustering, outlier detection, or further algorithmic analysis because it does not preserve densities or distances reliably. The visual separation you see between clusters in a t-SNE plot may be artificial—created by the algorithm's optimization rather than reflecting true separation in the data. Always be skeptical of downstream analysis based on t-SNE embeddings. Uniform Manifold Approximation and Projection Uniform Manifold Approximation and Projection (UMAP) assumes the data are uniformly distributed on a locally connected Riemannian manifold (a smooth, curved surface) with an approximately constant metric. UMAP produces visualizations similar to t-SNE but with better theoretical foundation and more reliable preservation of both local and global structure. Like t-SNE, it's primarily useful for visualization. Deep Learning Approach: Autoencoders Autoencoders represent a neural network approach to dimensionality reduction. An autoencoder is a feedforward neural network with a bottleneck architecture: it compresses data through increasingly narrow hidden layers to a low-dimensional "code" layer (the bottleneck), then reconstructs the original data through increasingly wide layers. The network learns two functions simultaneously: an encoder that maps high-dimensional input to low-dimensional representation, and a decoder that reconstructs the input from the low-dimensional code. The loss function is reconstruction error—how well the decoded output matches the original input. Autoencoders can learn nonlinear dimension-reduction functions that capture complex structure. Training typically involves greedy layer-wise pre-training (often using restricted Boltzmann machines for each layer) followed by fine-tuning with backpropagation on the entire network. This staged approach helps avoid local minima during optimization. The advantage of autoencoders: they naturally learn both encoding and decoding functions, making them useful both for compression and for generating new data. The disadvantage: they require more data and careful hyperparameter tuning compared to simpler methods. Practical Guidance and Related Concepts When to Apply Dimensionality Reduction A particularly important practical guideline: for datasets with more than ten dimensions, dimensionality reduction is often applied before using the k-nearest neighbors algorithm to avoid the curse of dimensionality. As dimension increases, the notion of distance becomes increasingly uniform—all points become roughly equidistant from each other—making nearest-neighbor queries unreliable. Reducing to 2-10 dimensions with techniques like PCA or UMAP can dramatically improve nearest-neighbor algorithm performance. Background Concepts and Tools Singular Value Decomposition Singular Value Decomposition (SVD) factorizes a matrix into singular vectors and singular values. Mathematically, any matrix $X$ can be written as $X = U\Sigma V^T$, where $U$ and $V$ contain singular vectors and $\Sigma$ contains singular values. SVD forms the computational and theoretical basis for many linear reduction methods including PCA. Understanding SVD provides insight into why PCA works and how to compute it efficiently. <extrainfo> Johnson–Lindenstrauss Lemma The Johnson–Lindenstrauss lemma provides a theoretical guarantee: you can randomly project $n$ points from high dimensions to roughly $\log(n)$ dimensions while preserving all pairwise distances within a small distortion. This theorem justifies random projection as a valid dimensionality reduction technique and provides bounds on how many dimensions you actually need. Random Projection Random Projection reduces dimensionality by projecting data onto a lower-dimensional subspace using a random matrix, preserving distances with high probability according to the Johnson–Lindenstrauss lemma. It's computationally very fast and surprisingly effective, though the random matrix approach is less interpretable than methods like PCA. Information Gain and Feature Selection Information gain measures the reduction in entropy achieved by splitting a dataset on a particular feature, commonly used in filter-based feature selection for decision trees. While related to dimensionality reduction, feature selection typically keeps original features intact rather than creating new linear combinations, making it a complementary but distinct approach. Latent Semantic Analysis Latent Semantic Analysis (LSA) reduces dimensionality of text data by capturing underlying semantic structures using SVD on term-document matrices. It reveals that words with similar meanings have similar document distributions even when they don't appear together. Sufficient Dimension Reduction Sufficient Dimension Reduction seeks a lower-dimensional representation that preserves all information needed for a specific statistical inference (like prediction or classification). This is more targeted than PCA, focusing only on aspects of the data relevant to your specific task. </extrainfo> Quick Reference: Choosing a Technique When selecting a dimensionality reduction method, consider these key questions: Do you have labels? LDA for classification, PCA for unsupervised exploration Is your data fundamentally non-negative? Consider NMF Do you need the structure to be nonlinear? Choose KPCA, manifold learning, or autoencoders Is visualization your primary goal? t-SNE or UMAP Do you need fast computation? PCA or random projection Do you need interpretable lower-dimensional features? PCA or NMF Remember: no single technique is universally best. The choice depends on your data, your goal, and the specific structure you want to preserve.

Flashcards

Which two types of matrices are constructed in Principal Component Analysis to compute eigenvectors?

Covariance or correlation matrices.

Which eigenvectors are designated as the principal components in PCA?

Those corresponding to the largest eigenvalues.

What do the first few principal components in a PCA often represent in physical systems?

Large-scale physical behavior (containing the majority of the system's energy).

What is the primary goal of Linear Discriminant Analysis (LDA)?

To find a linear combination of features that best separates two or more classes.

How does Generalized Discriminant Analysis extend standard LDA to nonlinear problems?

By using kernel functions (similar to support-vector machines).

Into what components does Non-negative Matrix Factorization (NMF) decompose a data matrix?

The product of two non-negative matrices.

What is a key difference between NMF and PCA regarding data pre-processing?

NMF does not subtract the mean (preserving physical non-negative fluxes).

What technique does Kernel PCA use to construct nonlinear mappings that maximize variance?

The kernel trick.

What type of cost function is typically used by Manifold Learning techniques to preserve local data properties?

A graph-based cost function.

What specific type of distance does Isomap attempt to preserve between data points?

Geodesic distances.

How does Locally Linear Embedding (LLE) create a low-dimensional embedding?

By preserving relationships among each point and its nearest neighbors.

What mathematical construct do Laplacian Eigenmaps use to preserve local neighborhood structures?

The graph Laplacian.

What is the dual objective of Maximum Variance Unfolding (MVU) regarding distances?

Preserving pairwise distances between nearest neighbors while maximizing distances between non-neighbors.

Which linear dimensionality reduction technique is mathematically equivalent to Classical Multidimensional Scaling (MDS)?

Principal Component Analysis (PCA).

What mechanism is used to derive the distances for embedding in Diffusion Maps?

Random walks on the data graph.

What objective does t-SNE minimize to create an embedding?

The divergence between probability distributions over pairs of points.

For which two tasks is t-SNE generally not recommended due to unreliable density and distance preservation?

Clustering Outlier detection

What architectural feature of an Autoencoder allows it to learn dimension-reduction functions?

A bottleneck hidden layer.

What are the two typical stages in training an Autoencoder?

Greedy layer-wise pre-training (e.g., using restricted Boltzmann machines) Fine-tuning with back-propagation

What fundamental assumption does UMAP make about the distribution of data on a manifold?

Data are uniformly distributed on a locally connected Riemannian manifold.

At what dimensionality threshold is it recommended to apply reduction before using k-nearest neighbors?

More than ten dimensions.

What does Information Gain measure when splitting a dataset on a feature?

The reduction in entropy.

What does the Johnson–Lindenstrauss lemma provide a bound for?

The number of dimensions needed to preserve pairwise distances within a small distortion using random projection.

How does Random Projection achieve dimensionality reduction while preserving distances?

By projecting data onto a lower-dimensional subspace using a random matrix.

What is the goal of Sufficient Dimension Reduction (SDR)?

To find a representation that preserves all information needed for a specific statistical inference.

Quiz

Which nonlinear dimensionality reduction method explicitly aims to preserve geodesic distances between data points when creating a low‑dimensional embedding?

1 of 3

Key Concepts

Linear Dimensionality Reduction

Principal Component Analysis

Linear Discriminant Analysis

Singular Value Decomposition

Random Projection

Johnson–Lindenstrauss Lemma

Sufficient Dimension Reduction

Nonlinear Dimensionality Reduction

Kernel Principal Component Analysis

Autoencoder

t‑Distributed Stochastic Neighbor Embedding

Uniform Manifold Approximation and Projection

Isomap

Non‑Negative Matrix Factorization

Definitions

Principal Component Analysis

A linear technique that projects data onto orthogonal axes capturing maximal variance via eigenvectors of the covariance matrix.

Linear Discriminant Analysis

A method that finds linear combinations of features that best separate predefined classes.

Non‑Negative Matrix Factorization

Decomposes a non‑negative data matrix into the product of two lower‑rank non‑negative matrices, preserving additive parts‑based structure.

Kernel Principal Component Analysis

Extends PCA to nonlinear mappings by applying the kernel trick to compute principal components in a high‑dimensional feature space.

Isomap

A manifold learning algorithm that preserves geodesic distances between points to embed high‑dimensional data in a lower‑dimensional space.

t‑Distributed Stochastic Neighbor Embedding

A visualization technique that minimizes the divergence between high‑dimensional and low‑dimensional pairwise similarity distributions.

Autoencoder

A neural network that learns to compress data into a low‑dimensional code and reconstruct it, providing nonlinear dimensionality reduction.

Uniform Manifold Approximation and Projection

Constructs a low‑dimensional representation assuming data lie on a uniformly distributed Riemannian manifold, yielding visualizations similar to t‑SNE.

Johnson–Lindenstrauss Lemma

States that a set of points can be projected into a lower‑dimensional space with only small distortion of pairwise distances using random linear maps.

Random Projection

Reduces dimensionality by projecting data onto a randomly generated subspace while approximately preserving distances.

Singular Value Decomposition

Factorizes a matrix into orthogonal singular vectors and singular values, forming the basis for many linear reduction methods.

Sufficient Dimension Reduction

Seeks a lower‑dimensional representation that retains all information needed for a specific statistical inference task.