Subjects/Technology/Data and AI/Machine Learning/Clustering

Introduction to Clustering

Understand the basics of clustering, key algorithms such as k‑means and hierarchical methods, and how to preprocess and evaluate clusters effectively.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary definition of clustering in machine learning?

1 of 25

Summary

Fundamentals of Clustering Introduction Clustering is an unsupervised machine learning technique that automatically discovers natural groupings within a dataset. Unlike supervised learning, where data points are labeled in advance, clustering works with unlabeled data and asks a fundamental question: which items are similar enough that they belong together? The goal of clustering is to partition data into groups called clusters such that: Points within the same cluster are close together (high similarity) Points in different clusters are far apart (low similarity) This technique is invaluable when you have a large dataset and want to understand its hidden structure without any pre-labeled categories. Common applications include customer segmentation, document organization, and gene grouping in biology. How Similarity is Measured Clustering algorithms determine which points are "close" using a distance function. The choice of distance metric is critical because it directly shapes the clusters you obtain. The most common distance metric for numeric features is Euclidean distance. For two points $p$ and $q$ with numeric features, Euclidean distance is: $$d(p, q) = \sqrt{\sum{i=1}^{n} (pi - qi)^2}$$ This calculates the straight-line distance between points in multi-dimensional space—the same distance you'd measure with a ruler. For binary attributes (features that are either 0 or 1), Hamming distance is typically used. It counts the number of positions where the two points differ. For example, if point $p = [1, 0, 1]$ and point $q = [1, 1, 1]$, their Hamming distance is 1 because they differ in exactly one position. A critical point that often confuses students: different distance metrics can produce very different clusterings of the same data. The metric you choose embeds assumptions about what "similarity" means for your problem. Partitional versus Hierarchical Clustering Clustering algorithms fall into two broad families: Partitional methods directly split the data into a fixed number of clusters (you choose this number beforehand). These methods are typically faster and work well when you have a clear sense of how many clusters you want. Hierarchical methods build a tree-like structure of clusters that you can examine and cut at any level. These methods are more flexible because you don't need to commit to a specific number of clusters upfront—you can cut the tree at different heights to get different numbers of clusters. Partitional Clustering: The k-means Algorithm The Algorithm and Process The k-means algorithm is the most widely used partitional clustering method. The "k" refers to the number of clusters, which you must specify before running the algorithm. Here are the core steps: Initialize: Randomly place $k$ cluster centers (called centroids) in your data space. Assign: Calculate the distance from each data point to each centroid. Assign every point to the nearest centroid based on your distance metric (usually Euclidean distance). Update: For each cluster, calculate the mean (average) of all points assigned to it, and move the centroid to this new position. Repeat: Go back to the assignment step. Keep iterating until cluster assignments stabilize—that is, points stop moving between clusters from one iteration to the next. When the assignments no longer change, the algorithm has converged, and you have your final clusters. Why this works: The algorithm is minimizing the total within-cluster variation. With each iteration, points pull their centroids closer to themselves, and centroids move to better represent their assigned points. This continues until reaching a local optimum. A conceptual point to keep clear: the centroid itself is usually not an actual data point—it's the mathematical average of all points in the cluster. If your data describes people with attributes like height (in cm) and weight (in kg), a centroid might have height = 175.3 cm and weight = 71.8 kg, even if no actual person has exactly those values. k-medoids: A Robust Variant k-medoids is a variation of k-means that addresses a key weakness: extreme values (outliers). Instead of using the mean of each cluster as the center, k-medoids uses an actual data point from the cluster, called a medoid. The advantage is robustness: because medoids must be real data points, they cannot be pulled toward outliers. If a cluster of customer incomes includes one billionaire, the medoid will still be a representative customer from the cluster, not a distorted average pulled toward the extreme value. The trade-off is computation time—finding the best medoid is more expensive than calculating a mean. But when outliers are a concern, k-medoids often produces more interpretable and stable results. Choosing the Number of Clusters Choosing $k$ is one of the most challenging aspects of partitional clustering. Here are three practical techniques: The Elbow Method: Run k-means for different values of $k$ (say, $k = 1$ through $k = 10$), and calculate the total within-cluster variation for each. Plot these values against $k$. The "elbow" is the point where the curve bends sharply—where adding more clusters produces diminishing improvements. This elbow suggests a good value of $k$. The reasoning is that you've captured the main structure, and additional clusters add little value. Silhouette Scores: For each data point, calculate how similar it is to its own cluster compared to other clusters. A silhouette score ranges from -1 to 1, where higher values (closer to 1) mean the point is well-matched to its cluster. The average silhouette score for the entire clustering indicates overall quality. Compute this for different values of $k$ and choose the $k$ with the highest average silhouette score. Gap Statistic: This method compares the observed within-cluster dispersion to what you'd expect if data were randomly distributed. The "gap" is the difference between these two values. The value of $k$ that maximizes this gap suggests you've found meaningful structure rather than arbitrary grouping. Each method has strengths: the elbow method is simple and visual; silhouette scores directly measure cluster quality; gap statistics are theoretically grounded. In practice, trying multiple methods and seeing where they agree builds confidence in your choice. Hierarchical Clustering Agglomerative Clustering (Bottom-Up) Agglomerative clustering takes a bottom-up approach. It starts with each data point in its own singleton cluster, then repeatedly merges the two closest clusters until all points are in a single cluster. The key steps: Start: Treat every data point as its own cluster ($n$ clusters for $n$ points). Merge: Find the two clusters that are closest to each other and merge them into one, reducing the cluster count by one. Repeat: Calculate distances between clusters again and merge the closest pair. Continue until one cluster remains. The concept of "closest clusters" requires a linkage criterion—a rule for measuring distance between clusters. Common choices include: Single linkage: Distance between two clusters is the distance between their closest points. Complete linkage: Distance between two clusters is the distance between their farthest points. Average linkage: Distance is the average distance between all pairs of points (one from each cluster). Different linkage criteria produce different hierarchical structures. Single linkage tends to produce long, chain-like clusters; complete linkage produces more compact clusters; average linkage is a middle ground. Divisive Clustering (Top-Down) Divisive clustering is the opposite approach. It starts with all data points in one large cluster, then repeatedly splits the most heterogeneous (most spread out) cluster into two sub-clusters. The steps: Start: All data points in one cluster. Split: Identify the cluster with the highest internal variation or heterogeneity. Partition this cluster into two sub-clusters. Repeat: Continue splitting until each point is in its own singleton cluster, or until some stopping criterion is met. Divisive clustering is conceptually straightforward but computationally more expensive than agglomerative clustering, so it's less commonly used in practice. However, it can sometimes produce better results because decisions made early (about how to split) are made on complete data, rather than progressively limited subsets. Dendrograms: Visualizing the Hierarchy A dendrogram is a tree diagram that visualizes the sequence of merges (in agglomerative) or splits (in divisive) during hierarchical clustering. Here's how to read a dendrogram: Vertical axis: Represents the distance or dissimilarity between clusters being merged. Horizontal axis: Represents the data points (or clusters). Branches: Show which points or clusters are combined. Height of merge: The vertical position where two branches meet indicates the distance at which those clusters were merged. Higher merges represent joining clusters that are more dissimilar. By examining a dendrogram, you can see natural gaps—points where many merges happen at low heights (points are very similar) followed by a large jump to the next merge (clusters are now quite different). These gaps suggest natural clustering structures. Cutting the Dendrogram to Extract Clusters One key advantage of hierarchical clustering is flexibility in choosing the number of clusters. Instead of pre-specifying $k$, you cut the dendrogram at a chosen height to partition the tree into clusters. For example, if you cut at a low height, you get many small clusters (each highly similar internally). If you cut at a higher height, you get fewer, larger clusters (less homogeneous but broader groupings). By cutting at different heights, you can explore cluster structures at multiple levels of granularity. The height you choose can be based on: Domain knowledge (you know the data should form 3-5 groups) Visual inspection (you look for large vertical gaps in the dendrogram) Statistical criteria (similar to those used for k-means) This flexibility is powerful when you're exploring unfamiliar data and don't have strong prior beliefs about the number of clusters. Practical Considerations Feature Scaling Distance calculations are extremely sensitive to the scale of features. Consider this example: suppose you're clustering customers by age (ranging from 18 to 80) and annual spending (ranging from $5,000 to $500,000). In Euclidean distance calculations, differences in spending completely dominate differences in age because spending values are so much larger. The algorithm essentially ignores age and clusters based almost entirely on spending. This is rarely what you want. To prevent any single feature from dominating, you must scale your features before clustering. Standardization (also called z-score normalization) transforms each feature to have zero mean and unit variance: $$zi = \frac{xi - \mu}{\sigma}$$ where $\mu$ is the feature's mean and $\sigma$ is its standard deviation. After standardization, all features contribute equally to distance calculations. Normalization rescales each feature to a fixed range, such as $[0, 1]$: $$x'i = \frac{xi - \text{min}(x)}{\text{max}(x) - \text{min}(x)}$$ Both approaches equalize feature scales. Standardization is more common and is preferred when features follow a normal distribution. Normalization is useful when you need values in a specific range or when you want to bound outliers. Important practical tip: Always scale your data before clustering. This simple step often has a larger impact on your results than the choice of clustering algorithm itself. Evaluating Cluster Quality After running a clustering algorithm, you should evaluate whether the result makes sense. Internal validation metrics assess how well the clustering itself looks, without needing ground-truth labels. Silhouette scores (mentioned earlier for choosing $k$) also work as a validation metric. A high average silhouette score suggests tight, well-separated clusters. A low score suggests loose clustering where points might belong to different clusters. If ground-truth labels become available later (perhaps through manual inspection or domain expertise), you can perform external validation by comparing your clusters to the true labels. The adjusted Rand index is one common measure—it quantifies the agreement between predicted and true cluster assignments, adjusted for chance agreement. The distinction between internal and external validation matters: internal metrics evaluate whether clusters are compact and separated in your data space, while external metrics evaluate whether the clusters match some external truth about what the "right" answer is. Interpreting Cluster Centers and Representatives Once you have clusters, you need to understand what they represent. For k-means, the cluster centroid summarizes the average attribute values of all points in that cluster. If you're clustering customers and a centroid has age = 35, income = $75,000, and purchasefrequency = 8, this tells you the "typical" customer in that cluster. For k-medoids, the medoid is an actual data point. This is often easier to interpret: "This cluster is best represented by Customer ID #4729, who is 35 years old, makes $75,000 annually, and purchases 8 times per year." You can then examine this representative customer and others near it to understand the cluster's characteristics. The ability to interpret cluster centers and find representative examples is crucial for actionable insights. A cluster is only useful if you can understand and explain what makes it distinct.

Flashcards

What is the primary definition of clustering in machine learning?

An unsupervised technique that discovers natural groupings in a set of data points.

How does clustering handle data labels during the grouping process?

It works with unlabeled data and does not use pre-assigned labels.

What are the two main goals of clustering regarding point proximity?

Points within the same cluster are close to each other (high similarity). Points in different clusters are far apart (low similarity).

What is the most common way to measure similarity between data points?

A distance function.

Which distance measure is typically used for numeric features?

Euclidean distance.

Which distance measure is typically used for binary attributes?

Hamming distance.

What are the two main families of clustering algorithms?

Partitional methods Hierarchical methods

How do partitional methods divide data?

They directly split the data into a fixed number $k$ of clusters.

What must a user specify before running a partitional algorithm?

The number of clusters $k$.

What are the iterative steps of the $k$-means algorithm?

Initialize $k$ centroids (often randomly). Assign each data point to the nearest centroid. Update each centroid to be the average (mean) of its assigned points. Repeat assignment and update until assignments no longer change.

What does $k$-medoids use as cluster centers instead of calculated means?

Actual data points from the set.

Why is $k$-medoids considered more robust to outliers than $k$-means?

Because a medoid (actual point) cannot be pulled toward extreme values like a mean can.

How does the elbow method help select the number of clusters $k$?

It looks for a point where the improvement in total within-cluster variation slows down.

What do silhouette scores measure to evaluate a chosen $k$?

How similar each point is to its own cluster compared to other clusters.

How do gap statistics determine the optimal number of clusters $k$?

By comparing observed within-cluster dispersion to a reference null distribution and maximizing the gap.

What kind of structure do hierarchical methods build?

A tree-like structure of clusters.

How does the agglomerative (bottom-up) procedure operate?

Starts with every data point in its own singleton cluster. Merges the two closest clusters at each step. Continues until all points are in a single cluster.

How does the divisive (top-down) procedure operate?

Starts with all data points in one large cluster. Splits the most heterogeneous cluster into two sub-clusters at each step. Continues until each point is in its own cluster or a stopping criterion is met.

What does the height of a linkage in a dendrogram represent?

The distance between the merged clusters.

How can a user obtain a specific number of clusters from a dendrogram?

By cutting the dendrogram at a chosen height.

Why is feature scaling critical for clustering algorithms?

Because distance calculations are sensitive to scale; unscaled features can dominate the outcome.

What is the result of standardizing a numeric feature?

The feature has a mean of zero and unit variance.

What is the result of normalizing a numeric feature?

The feature is rescaled to a fixed range, such as $[0, 1]$.

What is the purpose of internal validation metrics like silhouette scores?

To quantify how well points fit within their assigned clusters without using external labels.

Which measure is commonly used for external validation when ground-truth labels are available?

Adjusted Rand index.

Quiz

During the k‑means algorithm, how is each centroid updated after assigning points?

1 of 7

Key Concepts

Clustering Techniques

Clustering

k-means clustering

k-medoids

Hierarchical clustering

Distance Metrics

Euclidean distance

Hamming distance

Clustering Evaluation and Selection

Dendrogram

Elbow method

Silhouette coefficient

Gap statistic

Feature scaling

Standardization (statistics)

Definitions

Clustering

An unsupervised machine‑learning technique that groups unlabeled data points into subsets of high internal similarity and low external similarity.

k-means clustering

A partitional algorithm that iteratively assigns points to the nearest of k centroids and updates centroids to the mean of their assigned points.

k-medoids

A variant of k‑means that selects actual data points as cluster representatives, making it more robust to outliers.

Hierarchical clustering

A family of algorithms that build a tree‑like structure of nested clusters either by successive merging (agglomerative) or splitting (divisive).

Dendrogram

A tree diagram that visualizes the sequence of merges or splits in hierarchical clustering, with branch heights indicating inter‑cluster distances.

Euclidean distance

The straight‑line metric used to measure similarity between numeric feature vectors in many clustering methods.

Hamming distance

A metric that counts differing positions between binary strings, used for similarity assessment of categorical or binary data.

Elbow method

A technique for selecting the number of clusters k by locating the point where the reduction in within‑cluster variance sharply slows.

Silhouette coefficient

A validation measure that compares a point’s average intra‑cluster distance to its nearest‑cluster distance, indicating clustering quality.

Gap statistic

A method that chooses k by comparing observed within‑cluster dispersion to that expected under a reference null distribution.

Feature scaling

The preprocessing step of adjusting variable ranges (e.g., via standardization or normalization) to ensure fair distance calculations.

Standardization (statistics)

Transforming numeric features to have zero mean and unit variance, a common scaling approach for clustering algorithms.