Agglomerative Clustering
Definition
Agglomerative Clustering is a type of hierarchical clustering method where data points are grouped together based on similarity. Initially, each data point is treated as an individual cluster, and then in successive iterations, the closest clusters are merged until only one large cluster remains or until a specified stopping criterion is met.
How it works
Agglomerative clustering starts with each data point as its own cluster. The algorithm then iterates, identifying the two clusters that are closest to each other based on a defined distance metric (e.g., Euclidean, Manhattan). These two clusters are then merged into a single cluster. This process continues iteratively, merging the closest pairs of clusters in each step until all data points are merged into a single cluster or until other stopping criteria are achieved. A dendrogram, which is a tree-like diagram, can be used to represent the sequence of merges, providing a visual representation of the hierarchical structure of data.
Considerations
Choice of Distance Metric: The outcome can vary significantly depending on the chosen distance metric (e.g., Euclidean, Manhattan).
Scalability: Agglomerative clustering can be computationally intensive for large datasets.
Sensitivity: The method can be sensitive to outliers, which might affect the quality of the clusters formed.
Key Test Considerations
Unsupervised Learning:
- Number of Clusters: Determine an optimal number of clusters using the dendrogram and techniques like the elbow method.
Cluster Analysis:
Silhouette Score: Evaluates how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
Dunn Index: Measures the ratio between the smallest distance between observations not in the same cluster to the largest intra-cluster distance.
Hierarchical Clustering:
- Cophenetic Correlation Coefficient: Measures the correlation between the distances of points in feature space and their distances on the dendrogram. Helps assess the fidelity of the dendrogram in preserving pairwise distances between samples.
Agglomerative Clustering:
- Linkage Criteria: Test different linkage criteria (e.g., single, complete, average) to determine which produces the most cohesive clusters for the data at hand.
Platforms, Tools, or Libraries
scikit-learn:
A versatile machine learning library in Python.
The
AgglomerativeClustering
class in scikit-learn provides this functionality.
SciPy:
A Python library used for scientific and technical computing.
The
scipy.cluster.hierarchy
module provides functions for hierarchical and agglomerative clustering, including thelinkage
anddendrogram
functions.
R:
The
hclust
function in the stats package provides agglomerative clustering.The
agnes
function in thecluster
package offers a more extensive implementation.
MATLAB:
- Offers the
linkage
function for hierarchical agglomerative clustering anddendrogram
for visualization.
- Offers the
Weka:
A collection of machine learning algorithms for data mining tasks.
The
HierarchicalClusterer
class provides an implementation of agglomerative clustering.
References
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of Classification, 31(3), 274-295. Link.
Scikit-learn. (30 Jun 2023). Scikit-learn Documentation: Agglomerative Clustering. Link.