Unsupervised learning - Core Techniques and Algorithms
Understand core unsupervised learning techniques such as Hebbian learning, self‑organizing maps, clustering, anomaly detection, and latent‑variable methods like EM and the method of moments.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the fundamental principle of Hebbian learning regarding neurons?
1 of 15
Summary
Unsupervised Learning in Neural Networks
Introduction
Unsupervised learning represents a fundamentally different approach to training neural networks compared to supervised learning. Rather than learning to predict labeled targets, unsupervised models learn the underlying structure of data by finding patterns, reducing dimensionality, or identifying groups without explicit labels. This guide covers the classical principles and algorithms that form the foundation of modern unsupervised learning.
The diagram above illustrates how unsupervised learning tasks (imagine pictures, generate videos, model languages) differ from supervised tasks (recognize images, question & answer, analyze sentiments). Unsupervised methods work directly with raw data to discover its inherent organization.
Classical Foundations: Hebbian Learning
Hebbian learning is one of the oldest and most intuitive principles in neural networks. Proposed by Donald Hebb, it states that neurons that fire together wire together. In other words, when two neurons are active simultaneously, the connection between them should be strengthened.
Mathematically, a simplified Hebbian learning rule can be expressed as:
$$\Delta w{ij} = \eta \cdot ai \cdot aj$$
where $\Delta w{ij}$ is the change in weight between neurons $i$ and $j$, $\eta$ is a learning rate, and $ai$ and $aj$ are the activation values of the two neurons.
The key insight is that Hebbian learning is unsupervised—it requires no labeled data, only the natural correlations between neural activations. This makes it biologically plausible and computationally simple. However, Hebbian learning has limitations: weights can grow unbounded, and the learning is entirely local, based only on coincidence of firing.
Self-Organizing Maps (SOMs)
The Self-Organizing Map is a classical neural network architecture that learns to create a topographic map of the input space. This means that the spatial organization of neurons in the map reflects the organization of features in the data—inputs with similar properties activate nearby neurons.
SOMs are particularly useful for visualization because they reduce high-dimensional data to a typically two-dimensional grid while preserving the neighborhood structure of the data. The key mechanism is:
Competitive learning: When an input arrives, the neuron whose weights are most similar to the input (the "winner") is selected.
Cooperative learning: Not only the winner, but also its neighbors in the map, have their weights updated to become more similar to the input.
This creates a self-organizing effect where the map gradually develops a structure that reflects the input distribution. Distant regions of the map come to represent dissimilar inputs, while nearby regions represent similar ones.
Training Objectives in Unsupervised Learning
Understanding Reconstruction Error
Unlike supervised learning, which minimizes prediction error on labeled targets, unsupervised neural networks typically minimize reconstruction error. The network learns to:
Encode input data into a compressed internal representation
Decode that representation back to reconstruct the original input
Minimize the difference between original and reconstructed data
This reconstruction objective forces the network to capture the essential structure of the data in its hidden representations.
Learning Algorithms for Unsupervised Networks
Several fundamental algorithms power unsupervised learning. Understanding these is crucial because they represent different philosophical approaches to learning model parameters:
Variational Inference
Variational inference approaches unsupervised learning from a Bayesian perspective. Rather than finding a single best set of parameters, it approximates the full probability distribution over possible parameters and latent variables. This is useful when you want to quantify uncertainty about what the model has learned.
Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation finds parameters that maximize the probability of observing the actual data. Mathematically, it seeks to maximize:
$$\mathcal{L}(\theta) = P(X | \theta)$$
where $X$ is the observed data and $\theta$ represents the model parameters. MLE is intuitive: it finds the parameters that make the observed data as probable as possible.
Maximum A Posteriori (MAP)
Maximum a posteriori estimation extends MLE by incorporating a prior belief about parameters. Instead of just maximizing the likelihood of data, MAP maximizes:
$$P(\theta | X) \propto P(X | \theta) \cdot P(\theta)$$
This allows incorporating domain knowledge through the prior $P(\theta)$, making MAP a middle ground between pure Bayesian inference and point-estimate methods like MLE.
Gibbs Sampling
Gibbs sampling is a Markov chain Monte Carlo technique used primarily with energy-based models (such as Boltzmann machines). Rather than finding a single parameter value, it draws samples from the posterior distribution by iteratively sampling each variable conditioned on the current values of all others. This is powerful for complex models where direct optimization is intractable.
Backpropagation of Reconstruction Error
In autoencoders and similar architectures, standard backpropagation is used to minimize reconstruction error. The error signal flows backward through the network: the difference between original and reconstructed input is computed, and gradients are used to update weights to minimize this error.
Clustering Algorithms
Clustering is the task of grouping similar data points together without labeled information. Several fundamental approaches exist:
K-Means Clustering
K-means is perhaps the most widely used clustering algorithm. It partitions data into exactly $k$ clusters by:
Initializing $k$ random cluster centers
Assigning each point to the nearest center
Updating each center to be the mean of its assigned points
Repeating until convergence
K-means minimizes within-cluster variance (the sum of squared distances from points to their cluster centers). The algorithm is simple and efficient, but requires you to specify $k$ in advance, and it can converge to local optima depending on initialization.
Hierarchical Clustering
Hierarchical clustering builds a tree (dendrogram) of clusters by recursively merging or splitting clusters. There are two main variants:
Agglomerative: Start with each point as its own cluster, then repeatedly merge the two closest clusters
Divisive: Start with all points in one cluster, then recursively split clusters
Hierarchical clustering has the advantage that you don't need to specify the number of clusters—you can cut the tree at any level. However, it's computationally more expensive than K-means.
Mixture Models
Mixture models assume that data are generated from a mixture of probability distributions. For example, the data might come from a weighted combination of several Gaussian distributions:
$$P(x) = \sum{k=1}^{K} \pik \cdot \mathcal{N}(x | \muk, \Sigmak)$$
where $\pik$ is the weight of the $k$-th distribution, and $\muk$ and $\Sigmak$ are its mean and covariance. Unlike K-means (which makes hard assignments), mixture models provide soft assignments—each point has a probability of belonging to each cluster.
Learning mixture models typically uses the Expectation-Maximization (EM) algorithm, discussed below.
Density-Based Clustering: DBSCAN and OPTICS
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as connected dense regions in the data space. It doesn't require specifying $k$ in advance, and it naturally identifies outliers as points in sparse regions. However, DBSCAN struggles when clusters have varying densities.
OPTICS (Ordering Points To Identify Clustering Structure) extends DBSCAN by ordering points in a way that reveals cluster structure at multiple density levels. Rather than finding a single clustering, OPTICS provides a hierarchical view of how cluster structure changes with density thresholds.
Anomaly Detection Methods
Anomaly detection identifies unusual or outlying points that don't fit the normal pattern of data.
Local Outlier Factor (LOF)
Local Outlier Factor measures how isolated a point is relative to its local neighbors. The key insight is that outliers are often points whose density is much lower than their neighbors' densities. LOF computes a score for each point based on the density of its $k$-nearest neighbors compared to its own density. Points with high LOF scores are anomalies.
Isolation Forest
Isolation Forest uses a different philosophy: anomalies are points that are easier to isolate than normal points. The algorithm recursively partitions the data space using random splits. Normal points require many splits to isolate, while anomalies can be isolated with fewer splits. This intuition makes Isolation Forest particularly efficient for high-dimensional data.
Latent Variable Learning
Latent variable models assume that observed data $X$ was generated from some unobserved (latent) variables $Z$. Learning involves discovering what $Z$ must be and what parameters govern their relationship.
Expectation-Maximization (EM)
The Expectation-Maximization algorithm iteratively estimates latent variables and model parameters through two steps:
E-step (Expectation): Given current parameter estimates, compute the expected values of latent variables
M-step (Maximization): Given expected latent variables, optimize parameters to maximize likelihood
EM is guaranteed to improve the likelihood at each iteration, though it may converge to a local optimum. It's particularly valuable for mixture models, factor analysis, and other latent variable models where computing maximum likelihood directly is intractable.
Method of Moments
The method of moments takes a fundamentally different approach. It relates unknown model parameters directly to statistical moments of the observed data:
The first-order moment is the mean vector: $\mathbb{E}[X]$
The second-order moment (when centered at zero) is the covariance matrix: $\mathbb{E}[XX^T]$
By relating model parameters to these moments, you can solve for parameters directly without iteration. This has a key advantage: method of moments avoids local optima—the solution is deterministic once you compute sample moments.
The tradeoff is that method of moments may be less statistically efficient (requiring more data) than maximum likelihood methods, and it requires that the model's parameters can be expressed in terms of moments.
Blind Signal Separation Techniques
Blind signal separation refers to extracting independent signals from mixed observations without knowing the mixing process. Several techniques accomplish this:
Principal Component Analysis (PCA)
PCA finds directions of maximum variance in the data. It identifies a set of orthogonal directions (principal components) such that projecting data onto these directions maximizes variance. This is useful for dimensionality reduction and denoising.
Independent Component Analysis (ICA)
ICA extends PCA by finding components that are not just uncorrelated, but statistically independent. This is particularly powerful for problems like separating audio sources mixed in a room—the original sources are independent, even if their mixture is correlated.
Non-Negative Matrix Factorization (NMF)
NMF decomposes data into non-negative factors. This constraint is meaningful when data are naturally non-negative (like image pixel intensities or word counts) and you want interpretable components that can be meaningfully added together.
Singular Value Decomposition (SVD)
SVD decomposes a data matrix into three components: $X = U \Sigma V^T$. This provides the optimal low-rank approximation of the data and is the mathematical foundation for PCA.
<extrainfo>
Note on comparing EM and method of moments: While EM can become trapped in local optima and has no guarantee of finding true parameters, it often achieves better statistical efficiency and can be more flexible in handling complex models. Method of moments, conversely, provides a direct solution without iteration, avoiding local optima entirely, but may require stronger assumptions about what moments tell us about parameters.
</extrainfo>
Summary: Unsupervised learning encompasses a rich set of classical principles (Hebbian learning, SOMs) and modern algorithms (clustering, anomaly detection, latent variable learning) that discover structure in unlabeled data. These methods differ in their assumptions about data generation, computational efficiency, and the types of structure they uncover. Understanding when and how to apply each is essential for working with real-world data.
Flashcards
What is the fundamental principle of Hebbian learning regarding neurons?
Neurons that fire together wire together.
On what basis does Hebbian learning reinforce connections between neurons?
The coincidence of action potentials.
What kind of structure does a Self-Organizing Map (SOM) create to represent inputs with similar properties?
A topographic map.
By what metric do unsupervised networks typically adjust their weights?
Reconstruction error.
How does hierarchical clustering build its structure of clusters?
By recursive merging or splitting to build a tree.
By what mathematical criterion does K-means partition data into clusters?
Minimizing within-cluster variance.
What is the underlying assumption of mixture models regarding data generation?
Data are generated from a mixture of probability distributions.
How does DBSCAN distinguish between clusters and noise?
It identifies dense regions of points and separates sparse noise.
In what way does the OPTICS algorithm extend the capabilities of DBSCAN?
It orders points to reveal cluster structure at multiple density levels.
What does the Local Outlier Factor (LOF) measure to identify anomalies?
The degree to which a point is isolated relative to its neighbors.
Why does an Isolation Forest partition data recursively?
To isolate anomalies more quickly than normal points.
What two components does the expectation–maximization (EM) algorithm iteratively estimate?
Hidden variables and model parameters.
What is the first-order moment in the Method of Moments?
The mean vector.
When the mean is zero, what does the second-order moment represent?
The covariance matrix.
What is a significant limitation of the expectation–maximization algorithm compared to the Method of Moments?
It can get trapped in local optima and lacks guaranteed convergence to true parameters.
Quiz
Unsupervised learning - Core Techniques and Algorithms Quiz Question 1: What does Hebb’s principle assert about neuronal connections?
- Neurons that fire together wire together (correct)
- Neurons that fire separately disconnect
- Synaptic weights decrease with simultaneous firing
- Neural connections depend on external rewards
Unsupervised learning - Core Techniques and Algorithms Quiz Question 2: What does the K‑means clustering algorithm seek to minimize?
- Within‑cluster variance (correct)
- Distance between cluster centroids
- Number of clusters defined
- Overall entropy of the data set
Unsupervised learning - Core Techniques and Algorithms Quiz Question 3: In the method of moments, when the mean is zero, the second‑order moment corresponds to which matrix?
- The covariance matrix (correct)
- The mean vector
- A higher‑order tensor
- The eigenvalue matrix of the data
Unsupervised learning - Core Techniques and Algorithms Quiz Question 4: What characteristic does a Self‑Organizing Map (SOM) preserve when mapping high‑dimensional inputs to a low‑dimensional grid?
- Topographic relationships between similar inputs (correct)
- Exact numeric values of each input dimension
- Temporal order of input presentation
- Probabilistic distribution of output labels
Unsupervised learning - Core Techniques and Algorithms Quiz Question 5: Which technique is used in autoencoders to adjust network weights based on the difference between the input and its reconstruction?
- Backpropagation of reconstruction errors (correct)
- Stochastic gradient descent on classification loss
- Hebbian learning of co‑activations
- Reinforcement learning with reward signals
Unsupervised learning - Core Techniques and Algorithms Quiz Question 6: What does the Local Outlier Factor (LOF) measure in anomaly detection?
- How isolated a point is compared to its neighbors (correct)
- The absolute distance from the origin
- The frequency of occurrence in the dataset
- The probability of belonging to a predefined class
Unsupervised learning - Core Techniques and Algorithms Quiz Question 7: In the method of moments, unknown model parameters are related to what aspects of the observed data?
- Statistical moments of the data (correct)
- Raw pixel values
- Hyperparameters of the optimizer
- Labels of the data
Unsupervised learning - Core Techniques and Algorithms Quiz Question 8: During training, what do unsupervised neural networks aim to mimic?
- The input data (correct)
- The desired output labels
- Random noise patterns
- A fixed weight matrix
What does Hebb’s principle assert about neuronal connections?
1 of 8
Key Concepts
Neural Learning and Networks
Hebbian learning
Self‑organizing map
Clustering and Dimensionality Reduction
K‑means clustering
DBSCAN
Principal component analysis
Statistical Inference and Sampling
Variational inference
Gibbs sampling
Expectation–maximization algorithm
Isolation forest
Method of moments
Definitions
Hebbian learning
A neural learning rule stating that synaptic connections strengthen when the presynaptic and postsynaptic neurons fire together.
Self‑organizing map
An unsupervised neural network that projects high‑dimensional data onto a low‑dimensional grid preserving topological relationships.
Variational inference
An approximate Bayesian method that transforms inference into an optimization problem over a family of tractable distributions.
Gibbs sampling
A Markov chain Monte Carlo algorithm that generates samples from a joint distribution by iteratively sampling each variable conditioned on the others.
K‑means clustering
A partitioning algorithm that assigns data points to a fixed number of clusters by minimizing within‑cluster variance.
DBSCAN
A density‑based clustering method that groups points in high‑density regions while labeling low‑density points as noise.
Isolation forest
An ensemble anomaly‑detection technique that isolates outliers by recursively partitioning data using random splits.
Expectation–maximization algorithm
An iterative procedure that alternates between estimating hidden variables (E‑step) and maximizing parameters (M‑step) to find maximum‑likelihood estimates.
Principal component analysis
A linear dimensionality‑reduction technique that transforms data to orthogonal components capturing maximal variance.
Method of moments
A statistical approach that estimates model parameters by equating sample moments with theoretical moments of the distribution.