Subjects/Science/Computer and Information Science/Computer Science/Dimensionality reduction

Dimensionality reduction Study Guide

Study Guide

📖 Core Concepts Dimensionality Reduction – Transform high‑dimensional data into a lower‑dimensional space while keeping the data’s essential structure (variance, distances, class separation). Intrinsic Dimension – The smallest number of variables needed to describe the underlying manifold of the data. Linear vs. Nonlinear – Linear methods (e.g., PCA) assume data lie on a flat subspace; nonlinear methods (e.g., Isomap, t‑SNE) capture curved manifolds. Feature Selection vs. Feature Extraction – Selection: pick a subset of original variables. Extraction: create new variables (projections) that are combinations of the originals. Curse of Dimensionality – As dimensions increase, data become sparse, distances lose meaning, and algorithms (e.g., k‑NN) become computationally heavy. 📌 Must Remember PCA: eigenvectors of the covariance matrix with the largest eigenvalues → principal components (PCs). LDA: finds linear combos that maximize between‑class variance / within‑class variance. Kernel PCA: applies PCA in a high‑dimensional feature space via a kernel function. Isomap: preserves geodesic distances (shortest path on the data manifold). t‑SNE: minimizes Kullback‑Leibler divergence between high‑dimensional and low‑dimensional pairwise similarity distributions; great for visualization, poor for clustering. Johnson–Lindenstrauss Lemma: Random projection into $k = O(\frac{\log n}{\varepsilon^2})$ dimensions preserves all pairwise distances within factor $(1\pm\varepsilon)$. Random Projection: multiply data matrix $X$ by a random matrix $R$ (entries $\sim \mathcal{N}(0,1)$) → $X' = X R$. Preserves distances with high probability. SVD: $X = U\Sigma V^\top$; the columns of $U\Sigma$ give the optimal low‑rank approximation (basis for PCA). 🔄 Key Processes PCA Workflow Center data (subtract mean). Compute covariance matrix $C = \frac{1}{N-1}X^\top X$. Solve $C\mathbf{v} = \lambda \mathbf{v}$ for eigenvalues $\lambda$ and eigenvectors $\mathbf{v}$. Sort eigenvectors by descending $\lambda$, keep top $k$. Project: $Z = X Vk$ (where $Vk$ contains selected eigenvectors). Kernel PCA Choose kernel $k(\mathbf{x},\mathbf{y})$ (e.g., RBF). Compute kernel matrix $K{ij}=k(\mathbf{x}i,\mathbf{x}j)$. Center $K$: $Kc = K - \mathbf{1}K - K\mathbf{1} + \mathbf{1}K\mathbf{1}$. Eigen‑decompose $Kc$ → eigenvectors $\alpha$, eigenvalues $\lambda$. Project new point $\mathbf{x}$: $z = \sumi \alphai k(\mathbf{x},\mathbf{x}i) / \sqrt{\lambda}$. Isomap Build $k$‑nearest‑neighbor graph (or $\epsilon$‑radius). Compute shortest‑path (geodesic) distances $D{ij}$ on graph (Floyd‑Warshall/Dijkstra). Apply classical MDS on $D$ → low‑dim embedding. t‑SNE Compute high‑dim pairwise affinities $p{ij}$ using Gaussian kernels. Initialize low‑dim points randomly. Compute low‑dim affinities $q{ij}$ using Student‑t (heavy‑tailed) kernel. Gradient descent on KL divergence: $\displaystyle \frac{\partial C}{\partial yi}=4\sumj (p{ij}-q{ij})(yi-yj)(1+\|yi-yj\|^2)^{-1}$. 🔍 Key Comparisons PCA vs. LDA Goal: PCA → capture variance; LDA → maximize class separation. Supervision: PCA unsupervised; LDA supervised (needs class labels). Linear vs. Nonlinear Projection Linear: preserves global variance, easy to interpret (eigenvectors). Nonlinear: preserves local geometry (neighbors) but may distort global distances. t‑SNE vs. UMAP t‑SNE: focuses on local probabilities, expensive $O(N^2)$, “crowding problem”. UMAP: approximates manifold with fuzzy simplicial sets, faster $O(N\log N)$, better for preserving both local and global structure. Random Projection vs. PCA Random: computationally cheap, probabilistic guarantee (JL lemma). PCA: data‑driven, optimal variance capture, expensive eigen‑decomposition. ⚠️ Common Misunderstandings “PCA always improves classification.” PCA ignores class labels; it may discard discriminative features. “t‑SNE embeddings can be clustered directly.” t‑SNE distorts densities; clusters may be artefacts. “More dimensions = better model.” After a certain point, added dimensions add noise and hurt performance (curse of dimensionality). “Kernel PCA always beats linear PCA.” If data are truly linear, kernel adds unnecessary complexity and risk of over‑fitting. 🧠 Mental Models / Intuition PCA as a “best‑fit plane.” Imagine rotating a sheet of paper to line up with the direction that captures the most spread of scattered points. Manifold learning as “unfolding a crumpled map.” The data lie on a curved surface; algorithms try to flatten it while keeping neighbor relationships intact. Random projection as “shuffling a deck.” A random matrix mixes coordinates, but the overall shape (distances) stays roughly the same due to the JL lemma. 🚩 Exceptions & Edge Cases Non‑negative Matrix Factorization (NMF) requires all data ≥ 0; cannot be applied to centered (mean‑subtracted) data. LDA fails when classes are not linearly separable or when within‑class scatter matrix is singular (e.g., more features than samples). Isomap breaks if the neighbor graph is not fully connected (isolated clusters → infinite geodesic distances). t‑SNE perplexity parameter must be ≤ N/3; too high leads to loss of local structure. 📍 When to Use Which | Situation | Recommended Method | Reason | |-----------|-------------------|--------| | Goal: speed up k‑NN on >10‑D data | Random Projection or PCA (retain 90% variance) | Linear, cheap, preserves distances enough for NN. | | Goal: visualizing high‑D clusters | t‑SNE or UMAP (2‑D/3‑D) | Emphasizes local neighborhoods, produces eye‑catching plots. | | Goal: supervised classification with labeled data | LDA (if classes linearly separable) or kernel LDA | Directly maximizes class separation. | | Goal: capture nonlinear manifold | Isomap, LLE, Laplacian Eigenmaps, Diffusion Maps | Preserve geodesic or local neighborhood structure. | | Goal: interpretability with non‑negative components | NMF | Produces additive, parts‑based representation. | | Goal: guarantee distance preservation with minimal computation | Random Projection (use JL bound to pick $k$) | Probabilistic guarantee, no eigen‑decomp. | | Goal: compress text data | Latent Semantic Analysis (SVD on term‑document matrix) | Captures latent topics, reduces dimensionality. | 👀 Patterns to Recognize Eigenvalue Spectrum Drop‑off – A steep decline after the first few eigenvalues signals that few PCs capture most variance → safe to truncate. Disconnected Neighbor Graph – In Isomap or LLE, a broken graph indicates too small $k$ or too sparse data → increase neighbors. t‑SNE “crowding” – Many points collapse into a single blob when perplexity is too low; adjust perplexity or learning rate. Kernel Trick Appearance – Whenever a method mentions a “kernel” (RBF, polynomial), expect the algorithm works in an implicit high‑dimensional space without explicit transformation. 🗂️ Exam Traps “PCA always yields orthogonal features.” True for linear PCA; kernel PCA’s projected components are not orthogonal in the original space. “Higher perplexity in t‑SNE gives better global structure.” Perplexity balances local/global; too high can wash out local clusters. “Random projection guarantees exact distances.” JL lemma guarantees approximate preservation within $(1\pm\varepsilon)$. “LDA can produce more dimensions than classes‑1.” LDA’s maximum rank is $C-1$ (where $C$ = # classes). “Manifold learning works on any data set.” It assumes the data lie on a low‑dimensional manifold; if the manifold assumption fails, embeddings are meaningless. --- Keep this guide handy – a quick skim before the test will jog the core ideas, formulas, and decision rules you need to ace dimensionality‑reduction questions.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or