K-Means Clustering

Introduction

K-Means is one of the most popular unsupervised learning algorithms for clustering data. It partitions data into k clusters by grouping similar data points together and separating dissimilar ones. The algorithm aims to minimize the within-cluster sum of squares (WCSS).

K-Means assumes that clusters are spherical and roughly equal in size, making it particularly effective for globular cluster shapes but less suitable for elongated or irregular clusters.

How K-Means Works

1. Initialize Centroids

Randomly place k centroids in the data space, or use K-means++ for smarter initialization.

2. Assign Points to Clusters

Assign each data point to the nearest centroid based on Euclidean distance.

3. Update Centroids

Move each centroid to the center (mean) of all points assigned to its cluster.

4. Repeat Until Convergence

Continue steps 2-3 until centroids stop moving significantly or maximum iterations reached.

Interactive K-Means Clustering

Experiment with different values of k and initialization methods:

Click to add new data points • Lines connect points to their cluster centroids

Parameters

Controls

Statistics

Data Points: 0
Iterations: 0
Converged: No
WCSS: 0.000

Visual Guide:

  • ● Colored circles: Data points
  • ⊗ Large circles with ×: Centroids
  • — Lines: Point-to-centroid connections
  • 🎨 Colors: Different clusters

Choosing the Right K

Elbow Method

Plot WCSS vs. number of clusters and look for the "elbow" point where the rate of decrease sharply changes.

  • • Run K-means for k = 1 to k_max
  • • Calculate WCSS for each k
  • • Find the point of maximum curvature

Silhouette Analysis

Measures how similar points are to their own cluster compared to other clusters.

  • • Silhouette score ranges from -1 to 1
  • • Higher values indicate better clustering
  • • Choose k with highest average silhouette score

Initialization Methods

MethodDescriptionProsCons
RandomPlace centroids randomly in data spaceSimple, fastPoor clustering, sensitive to initialization
K-Means++Choose centroids to be far from each otherBetter initialization, faster convergenceSlightly more complex
ManualUse domain knowledge to place centroidsLeverages prior knowledgeRequires domain expertise

When to Use K-Means

✓ Good For:

  • • Spherical, well-separated clusters
  • • Similar cluster sizes
  • • Large datasets (computationally efficient)
  • • Market segmentation
  • • Image segmentation
  • • Data compression
  • • Preprocessing for other algorithms

✗ Avoid When:

  • • Non-spherical cluster shapes
  • • Clusters of very different sizes
  • • Clusters of different densities
  • • Noisy data with many outliers
  • • High-dimensional data
  • • Unknown number of clusters
  • • Hierarchical cluster structure

Pros and Cons

ProsCons
Computationally efficient O(nkt)Need to specify k in advance
Simple to understand and implementSensitive to initialization
Works well with spherical clustersStruggles with non-spherical shapes
Scales well to large datasetsSensitive to outliers
Guaranteed convergenceMay converge to local minimum

Key Takeaways

  • K-Means partitions data into k spherical clusters by minimizing within-cluster variance
  • The algorithm iteratively assigns points to nearest centroids and updates centroid positions
  • Choosing the right k is crucial - use elbow method or silhouette analysis
  • K-Means++ initialization significantly improves results compared to random initialization
  • Works best with spherical, similar-sized clusters and is sensitive to outliers
  • Computationally efficient and scales well to large datasets