import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
"ignore") warnings.filterwarnings(
Clustering
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.
We seek a partition of the data into distinct groups or clusters so that the observations within each group are quite similar to each other.
- To make this concrete, we must define what it means for two or more observations to be similar or different.
- Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.
This is an unsupervised problem because we are trying to discover structure—in this case, distinct clusters—on the basis of a data set.
Two Clustering Methods
- K-means clustering: Here we seek to partition the observations into a pre-specified number of clusters.
- Hierarchical Clustering: Here we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
Let us generate some data such that we can try different clustering algorithms. We want to generate 600 data points in a two-dimensional space gathered into three fairly separated clusters.
Code
from sklearn.datasets import make_blobs
'font.size'] = 12
plt.rcParams[
# Define cluster centers and standard deviations
= [[4, 2], [1, 7], [5, 6]]
centers = [0.8, 0.5, 0.7]
cluster_std
# Generate synthetic data
= make_blobs(n_samples=600, centers=centers, cluster_std=cluster_std, random_state=23)
X, labels
# Visualize the data
0], X[:, 1], c=labels, cmap='magma')
plt.scatter(X[:, plt.show()
Code
pd.DataFrame(X).head()
0 | 1 | |
---|---|---|
0 | 3.666363 | 0.322203 |
1 | 4.142521 | 5.438357 |
2 | 1.268230 | 6.913186 |
3 | 4.031165 | 2.311944 |
4 | 4.023620 | 1.255792 |
k-Means Clustering
Simple algorithm to partition a dataset into K distinct, non-overlapping clusters. It’s an estimation method that is also iterative.
Main Idea: The idea behind K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.
We minimize certain objective function: the within-cluster variation. We need to define the within-cluster variation. The most common choice involves Squared Euclidean Distance.
To perform K-means clustering, we must first pre-specify the desired number of clusters K; then the K-means algorithm will assign each observation to exactly one of the K clusters.
Applying k-means with scikit-learn
is quite straightforward. Here, we apply it to the synthetic data that we used for the preceding plots. We instantiate the KMeans
class, and set the number of clusters we are looking for k=3. Then we call the fit method with the data:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Build k-Means Clustering
= KMeans(n_clusters=3, random_state=0).fit(X) kmeans
KMeans(n_clusters=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=0)
During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in the kmeans.labels_
attribute:
= kmeans.labels_
labels =['cluster']).head() pd.DataFrame(labels, columns
cluster | |
---|---|
0 | 1 |
1 | 0 |
2 | 2 |
3 | 1 |
4 | 1 |
Let’s plot our results:
Code
# Figure
0], X[:, 1], c=labels, cmap='magma')
plt.scatter(X[:,
#Centers
0], kmeans.cluster_centers_[:, 1],
plt.scatter(kmeans.cluster_centers_[:, ='grey' ,marker="D", edgecolors='white', linewidth=1.5, s=50); facecolors
Number of Cluster
Silhouette Coefficient
The silhouette width is obtained by the silhouette_samples
function from the sklearn.metrics
module. By default, scikit-learn sets the metric to 'euclidean'
but, if desired, this can be changed by setting metric parameter:
# Silhouette
= metrics.silhouette_score(X, labels, metric='euclidean')
silhouette print('Silhouette Score =', silhouette.round(7))
Silhouette Score = 0.6974909
Code
from sklearn import metrics
# Figure
0], X[:, 1], c=labels, cmap='magma')
plt.scatter(X[:,
#Centers
0], kmeans.cluster_centers_[:, 1],
plt.scatter(kmeans.cluster_centers_[:, ='grey' ,marker="D", edgecolors='white', linewidth=1.5, s=50);
facecolors
# Silhouette
= metrics.silhouette_score(X, labels, metric='euclidean')
silhouette f'Silhouette Score = {silhouette:.2f}') plt.title(
Let’s run different k-means algorithm of different k and calculate their corresponding silhouette scores:
from sklearn.cluster import KMeans
from sklearn import metrics
# Set up the loop and plot
= plt.subplots(3, 3, figsize=(8, 8))
fig, axes = axes.reshape(-1)
axes = range(2, 11)
k_values
for k, ax in zip(k_values, axes):
= KMeans(n_clusters=k, random_state=0).fit(X)
kmeans = kmeans.labels_
labels
0], X[:, 1], c=labels, cmap='magma')
ax.scatter(X[:, 0], kmeans.cluster_centers_[:, 1],
ax.scatter(kmeans.cluster_centers_[:, ='grey' ,marker="D", edgecolors='white', linewidth=1.5, s=50)
facecolors
= metrics.silhouette_score(X, labels, metric='euclidean')
silhouette f'k = {k}, Silhouette = {silhouette:.2f}')
ax.set_title('off')
ax.axis(
fig.tight_layout() plt.show()
In any case, the optimal number of clusters id k=3.
Elbow Criteria
To select the optimal number of clusters for k-means clustering, we can implement the “elbow” method. The elbow method runs k-means clustering on the dataset for a range of values for k (say from 2-10) and then, for each value of k,computes an average score for all clusters.
When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.
Let’s compute first the mean silhouette coefficient:
from sklearn.cluster import KMeans
from sklearn import metrics
= range(2, 11)
k_values
# Store silhouette scores
= []
silhouette_scores
# Iterate through each k
for k in k_values:
# Fit KMeans model
= KMeans(n_clusters=k, random_state=0).fit(X)
kmeans
# Get cluster labels
= kmeans.labels_
labels
# Calculate silhouette score
= metrics.silhouette_score(X, labels, metric='euclidean')
silhouette
# Append silhouette score to list
silhouette_scores.append(silhouette)
# Plotting
='o', color='#44355b')
plt.plot(k_values, silhouette_scores, marker'Number of clusters')
plt.xlabel('Silhouette Score')
plt.ylabel('Silhouette Score for Different Numbers of Clusters')
plt.title(
plt.xticks(k_values)True)
plt.grid(
# Find elbow point
= np.diff(silhouette_scores)
diffs = k_values[np.argmax(diffs) + 1]
elbow_point =elbow_point, color='grey', linestyle='--', label=f'Elbow Point ({elbow_point} clusters)')
plt.axvline(x
plt.legend()
plt.show()
Instead of using silhouette score, we can use the squared error obtained by default from the kmeans.score()
method:
Code
from sklearn.cluster import KMeans
# Range of cluster numbers to try
= range(2, 11)
k_values
# Store squared errors
= []
squared_errors
# Iterate through each k
for k in k_values:
# Fit KMeans model
= KMeans(n_clusters=k, random_state=0).fit(X)
kmeans
# Calculate squared error (inertia)
= kmeans.score(X)
squared_error
# Append squared error to list
-squared_error) # Invert sign for plotting, as KMeans score returns negative value
squared_errors.append(
# Plotting
='o', c='#44355b')
plt.plot(k_values, squared_errors, marker'Number of clusters')
plt.xlabel('Squared Error')
plt.ylabel('Squared Error for Different Numbers of Clusters')
plt.title(
plt.xticks(k_values)True)
plt.grid(
# Find elbow point
= np.diff(squared_errors)
diffs = k_values[np.argmin(diffs) + 1]
elbow_point =elbow_point, color='grey', linestyle='--', label=f'Elbow Point ({elbow_point} clusters)')
plt.axvline(x
plt.legend()
plt.show()
Hierarchical Clustering
One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters K.
Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of K.
Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.
The hierarchical clustering dendrogram is obtained via an extremely simple algorithm.
Pairwise Dissimilarity: We begin by defining some sort of dissimilarity measure between each pair of observations (Euclidean distance). The algorithm proceeds iteratively.
- Starting out at the bottom of the dendrogram, each of the n observations is treated as its own cluster.
- The two clusters that are most similar to each other are then fused so that there now are n−1 clusters.
- Next the two clusters that are most similar to each other are fused again, so that there now are n − 2 clusters.
- The algorithm proceeds in this fashion until all of the observations belong to one single cluster, and the dendrogram is complete.
We will use the data to plot the hierarchical clustering dendrogram using complete, single, average and ward linkage clustering, with Euclidean distance as the dissimilarity measure:
Code
from sklearn.datasets import make_blobs
# Define cluster centers and standard deviations
= [[4, 2], [1, 7], [5, 6]]
centers = [0.8, 0.5, 0.7]
cluster_std # Generate synthetic data
= make_blobs(n_samples=600, centers=centers, cluster_std=cluster_std, random_state=23) X, labels
# Define linkage methods
= ['complete', 'average', 'single', 'ward']
linkage_methods
# Plot dendrograms
= plt.subplots(2, 2, figsize=(12, 10))
fig, axes
for i, method in enumerate(linkage_methods):
= i // 2
row = i % 2
col
# Compute linkage matrix
= linkage(X, method=method)
Z
# Plot dendrogram
=axes[row, col], color_threshold=np.inf, above_threshold_color='black')
dendrogram(Z, axf'Dendrogram - Method: {method.capitalize()}')
axes[row, col].set_title(
plt.tight_layout() plt.show()
If we define k=3, we have:
from scipy.cluster.hierarchy import dendrogram, linkage
# Perform hierarchical clustering
= linkage(distances, method='average') # You can also try other linkage methods
linkage_matrix
# Plot the dendrogram
=(8, 6))
plt.figure(figsize
dendrogram(linkage_matrix)'Hierarchical Clustering Dendrogram (Average)')
plt.title('Data Points')
plt.xlabel('Distance')
plt.ylabel( plt.show()
Number of Clusters
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
# Define methods of linkage
= ['complete', 'average', 'single', 'ward']
linkage_methods
# Define range of cluster numbers (k)
= range(2, 11)
k_values
# Create subplots for each method of linkage
= plt.subplots(2, 2, figsize=(12, 10))
fig, axes
# Compute silhouette coefficient for each method of linkage and each k
for method, ax in zip(linkage_methods, axes.flatten()):
= []
silhouette_scores for k in k_values:
# Perform hierarchical clustering
= AgglomerativeClustering(n_clusters=k, linkage=method)
clustering = clustering.fit_predict(X)
labels
# Compute silhouette coefficient
= silhouette_score(X, labels)
silhouette
silhouette_scores.append(silhouette)
# Plot silhouette scores for the current method of linkage
='#44355b', marker='o', label=method.capitalize())
ax.plot(k_values, silhouette_scores, colorf'Silhouette Score for {method.capitalize()} Linkage')
ax.set_title('Number of Clusters (k)')
ax.set_xlabel('Silhouette Score')
ax.set_ylabel(
ax.set_xticks(k_values)True)
ax.grid(
# Find elbow point
= np.diff(silhouette_scores)
diffs = k_values[np.argmax(diffs)] if np.any(diffs) else k_values[0]
elbow_point
# Plot vertical gray line at the elbow point
=elbow_point, color='grey' , linestyle='--', label='Elbow Point')
ax.axvline(x
ax.legend()
plt.tight_layout() plt.show()
Spectral Clustering
When you have complex-shaped data that’s hard to cluster, you can use spectral clustering. This method uses a similarity matrix to find patterns in the data. Think of it as a way to simplify the data before grouping it. The similarity matrix tells you how similar each pair of data points is.
from sklearn.datasets import make_circles, make_moons
from sklearn.cluster import KMeans
'font.size'] = '12'
plt.rcParams[
# Generate more complex synthetic data: Circles and Moons
= make_circles(factor=0.5, noise=0.05, n_samples=1500)
X1, _ = make_moons(n_samples=1500, noise=0.05)
X2, _
= plt.subplots(1, 2)
fig, ax 15, 5)
fig.set_size_inches(
for i, X in enumerate([X1, X2]):
# Apply KMeans clustering
= KMeans(n_clusters=2, random_state=0).fit(X)
kmeans = kmeans.labels_
cluster_labels
# Plot data with magma colors
0], X[:, 1], c=cluster_labels, cmap='magma')
ax[i].scatter(X[:, 'KMeans Clustering')
ax[i].set_title(
plt.tight_layout() plt.show()
It is clear that the current method (k-means clustering) cannot group moons and circles effectively. It is time to call spectral clustering methods:
from sklearn.datasets import make_circles, make_moons
from sklearn.cluster import SpectralClustering
'font.size'] = '12'
plt.rcParams[
# Generate more complex synthetic data: Circles and Moons
= make_circles(factor=0.5, noise=0.05, n_samples=1500)
X1, _ = make_moons(n_samples=1500, noise=0.05)
X2, _
= plt.subplots(1, 2)
fig, ax 15, 5)
fig.set_size_inches(
for i, X in enumerate([X1, X2]):
# Apply spectral clustering
= SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=0)
spectral_cluster = spectral_cluster.fit_predict(X)
cluster_labels
# Plot data with magma colors
0], X[:, 1], c=cluster_labels, cmap='magma')
ax[i].scatter(X[:, 'Spectral Clustering')
ax[i].set_title(
plt.tight_layout() plt.show()