K-Nearest Neighbors (KNN)

Introduction

K-Nearest Neighbors (KNN) is one of the simplest machine learning algorithms. It classifies new data points based on the majority class among the k nearest neighbors in the training set. Despite its simplicity, KNN can be surprisingly effective for many real-world problems.

KNN is a "lazy learning" algorithm because it doesn't build an explicit model during training. Instead, it stores all training data and makes predictions by examining the local neighborhood of each query point.

How KNN Works

1. Store Training Data

KNN simply stores all training examples without building an explicit model.

2. Calculate Distances

For a new query point, calculate distances to all training points using metrics like Euclidean or Manhattan distance.

3. Find K Nearest Neighbors

Select the k training points with the smallest distances to the query point.

4. Make Prediction

For classification: majority vote among k neighbors. For regression: average of k neighbors' values.

Interactive KNN Classifier

Click anywhere on the plot to place a query point and see how KNN classifies it:

Click anywhere to place a query point (black circle)

Right-click to add new training points

Parameters

Prediction

Click on the plot to place a query point

Controls

Legend:

  • 🔴 Red: Class 0
  • 🔵 Blue: Class 1
  • 🟢 Green: Class 2
  • âš« Black: Query point
  • 🟡 Yellow border: Selected neighbors
  • â­• Circles: Distance boundaries

Choosing K

K = 1

  • • Very sensitive to noise
  • • Complex decision boundaries
  • • High variance, low bias
  • • Prone to overfitting

Optimal K

  • • Usually odd numbers (avoid ties)
  • • Often √n where n = training size
  • • Found through cross-validation
  • • Balances bias and variance

Large K

  • • Smooth decision boundaries
  • • Less sensitive to noise
  • • Low variance, higher bias
  • • Risk of underfitting

Distance Metrics

MetricFormulaUse CaseProperties
Euclidean√Σ(xi - yi)²Continuous features, geometric dataStraight-line distance, sensitive to scale
ManhattanΣ|xi - yi|High-dimensional data, city-block distancesLess sensitive to outliers
Cosine1 - (A·B)/(||A||||B||)Text classification, sparse dataMeasures angle, not magnitude
HammingΣ(xi ≠ yi)Categorical data, binary stringsCounts differing positions

When to Use KNN

✓ Good For:

  • • Non-linear decision boundaries
  • • Local patterns in data
  • • Small to medium datasets
  • • Multi-class classification
  • • Recommendation systems
  • • Anomaly detection
  • • When data distribution is unknown

✗ Avoid When:

  • • High-dimensional data (curse of dimensionality)
  • • Large datasets (slow prediction)
  • • Real-time applications
  • • Imbalanced datasets
  • • Features have very different scales
  • • Sparse data
  • • Need interpretable models

Pros and Cons

ProsCons
Simple to understand and implementComputationally expensive at prediction time
No assumptions about data distributionSensitive to irrelevant features
Works well with small datasetsSuffers from curse of dimensionality
Can be used for both classification and regressionSensitive to local structure of data
Adapts to new data automaticallyMemory intensive (stores all training data)

Key Takeaways

  • KNN is a simple, non-parametric algorithm that makes no assumptions about data distribution
  • The choice of k is crucial: small k leads to overfitting, large k leads to underfitting
  • Feature scaling is essential since KNN relies on distance calculations
  • Distance metric selection depends on your data type and problem domain
  • KNN is computationally expensive at prediction time but requires no training
  • Works well for local patterns but struggles with high-dimensional data