K-Means Clustering
Introduction
K-Means is one of the most popular unsupervised learning algorithms for clustering data. It partitions data into k clusters by grouping similar data points together and separating dissimilar ones. The algorithm aims to minimize the within-cluster sum of squares (WCSS).
K-Means assumes that clusters are spherical and roughly equal in size, making it particularly effective for globular cluster shapes but less suitable for elongated or irregular clusters.
How K-Means Works
1. Initialize Centroids
Randomly place k centroids in the data space, or use K-means++ for smarter initialization.
2. Assign Points to Clusters
Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids
Move each centroid to the center (mean) of all points assigned to its cluster.
4. Repeat Until Convergence
Continue steps 2-3 until centroids stop moving significantly or maximum iterations reached.
Interactive K-Means Clustering
Experiment with different values of k and initialization methods:
Click to add new data points • Lines connect points to their cluster centroids
Parameters
Controls
Statistics
Visual Guide:
- ● Colored circles: Data points
- ⊗ Large circles with ×: Centroids
- — Lines: Point-to-centroid connections
- 🎨 Colors: Different clusters
Choosing the Right K
Elbow Method
Plot WCSS vs. number of clusters and look for the "elbow" point where the rate of decrease sharply changes.
- • Run K-means for k = 1 to k_max
- • Calculate WCSS for each k
- • Find the point of maximum curvature
Silhouette Analysis
Measures how similar points are to their own cluster compared to other clusters.
- • Silhouette score ranges from -1 to 1
- • Higher values indicate better clustering
- • Choose k with highest average silhouette score
Initialization Methods
Method | Description | Pros | Cons |
---|---|---|---|
Random | Place centroids randomly in data space | Simple, fast | Poor clustering, sensitive to initialization |
K-Means++ | Choose centroids to be far from each other | Better initialization, faster convergence | Slightly more complex |
Manual | Use domain knowledge to place centroids | Leverages prior knowledge | Requires domain expertise |
When to Use K-Means
✓ Good For:
- • Spherical, well-separated clusters
- • Similar cluster sizes
- • Large datasets (computationally efficient)
- • Market segmentation
- • Image segmentation
- • Data compression
- • Preprocessing for other algorithms
✗ Avoid When:
- • Non-spherical cluster shapes
- • Clusters of very different sizes
- • Clusters of different densities
- • Noisy data with many outliers
- • High-dimensional data
- • Unknown number of clusters
- • Hierarchical cluster structure
Pros and Cons
Pros | Cons |
---|---|
Computationally efficient O(nkt) | Need to specify k in advance |
Simple to understand and implement | Sensitive to initialization |
Works well with spherical clusters | Struggles with non-spherical shapes |
Scales well to large datasets | Sensitive to outliers |
Guaranteed convergence | May converge to local minimum |
Key Takeaways
- K-Means partitions data into k spherical clusters by minimizing within-cluster variance
- The algorithm iteratively assigns points to nearest centroids and updates centroid positions
- Choosing the right k is crucial - use elbow method or silhouette analysis
- K-Means++ initialization significantly improves results compared to random initialization
- Works best with spherical, similar-sized clusters and is sensitive to outliers
- Computationally efficient and scales well to large datasets