K-Means Clustering

Introduction

K-Means is one of the most popular unsupervised learning algorithms for clustering data. It partitions data into k clusters by grouping similar data points together and separating dissimilar ones. The algorithm aims to minimize the within-cluster sum of squares (WCSS).

K-Means assumes that clusters are spherical and roughly equal in size, making it particularly effective for globular cluster shapes but less suitable for elongated or irregular clusters.

How K-Means Works

1. Initialize Centroids

Randomly place k centroids in the data space, or use K-means++ for smarter initialization.

2. Assign Points to Clusters

Assign each data point to the nearest centroid based on Euclidean distance.

3. Update Centroids

Move each centroid to the center (mean) of all points assigned to its cluster.

4. Repeat Until Convergence

Continue steps 2-3 until centroids stop moving significantly or maximum iterations reached.

Interactive K-Means Clustering

Experiment with different values of k and initialization methods:

Click to add new data points • Lines connect points to their cluster centroids

Parameters

K (Number of Clusters): 3

Initialization Method

Controls

Statistics

Data Points: 0

Iterations: 0

Converged: No

WCSS: 0.000

Visual Guide:

● Colored circles: Data points
⊗ Large circles with ×: Centroids
— Lines: Point-to-centroid connections
🎨 Colors: Different clusters

Choosing the Right K

Elbow Method

Plot WCSS vs. number of clusters and look for the "elbow" point where the rate of decrease sharply changes.

• Run K-means for k = 1 to k_max
• Calculate WCSS for each k
• Find the point of maximum curvature

Silhouette Analysis

Measures how similar points are to their own cluster compared to other clusters.

• Silhouette score ranges from -1 to 1
• Higher values indicate better clustering
• Choose k with highest average silhouette score

Initialization Methods

Method	Description	Pros	Cons
Random	Place centroids randomly in data space	Simple, fast	Poor clustering, sensitive to initialization
K-Means++	Choose centroids to be far from each other	Better initialization, faster convergence	Slightly more complex
Manual	Use domain knowledge to place centroids	Leverages prior knowledge	Requires domain expertise

When to Use K-Means

✓ Good For:

• Spherical, well-separated clusters
• Similar cluster sizes
• Large datasets (computationally efficient)
• Market segmentation
• Image segmentation
• Data compression
• Preprocessing for other algorithms

✗ Avoid When:

• Non-spherical cluster shapes
• Clusters of very different sizes
• Clusters of different densities
• Noisy data with many outliers
• High-dimensional data
• Unknown number of clusters
• Hierarchical cluster structure

Pros and Cons

Pros	Cons
Computationally efficient O(nkt)	Need to specify k in advance
Simple to understand and implement	Sensitive to initialization
Works well with spherical clusters	Struggles with non-spherical shapes
Scales well to large datasets	Sensitive to outliers
Guaranteed convergence	May converge to local minimum

Key Takeaways

K-Means partitions data into k spherical clusters by minimizing within-cluster variance
The algorithm iteratively assigns points to nearest centroids and updates centroid positions
Choosing the right k is crucial - use elbow method or silhouette analysis
K-Means++ initialization significantly improves results compared to random initialization
Works best with spherical, similar-sized clusters and is sensitive to outliers
Computationally efficient and scales well to large datasets