K-Nearest Neighbors (KNN)

Introduction

K-Nearest Neighbors (KNN) is one of the simplest machine learning algorithms. It classifies new data points based on the majority class among the k nearest neighbors in the training set. Despite its simplicity, KNN can be surprisingly effective for many real-world problems.

KNN is a "lazy learning" algorithm because it doesn't build an explicit model during training. Instead, it stores all training data and makes predictions by examining the local neighborhood of each query point.

How KNN Works

1. Store Training Data

KNN simply stores all training examples without building an explicit model.

2. Calculate Distances

For a new query point, calculate distances to all training points using metrics like Euclidean or Manhattan distance.

3. Find K Nearest Neighbors

Select the k training points with the smallest distances to the query point.

4. Make Prediction

For classification: majority vote among k neighbors. For regression: average of k neighbors' values.

Interactive KNN Classifier

Click anywhere on the plot to place a query point and see how KNN classifies it:

Click anywhere to place a query point (black circle)

Right-click to add new training points

Parameters

K (Number of Neighbors): 3

Distance Metric

Prediction

Click on the plot to place a query point

Controls

Legend:

🔴 Red: Class 0
🔵 Blue: Class 1
🟢 Green: Class 2
⚫ Black: Query point
🟡 Yellow border: Selected neighbors
⭕ Circles: Distance boundaries

Choosing K

K = 1

• Very sensitive to noise
• Complex decision boundaries
• High variance, low bias
• Prone to overfitting

Optimal K

• Usually odd numbers (avoid ties)
• Often √n where n = training size
• Found through cross-validation
• Balances bias and variance

Large K

• Smooth decision boundaries
• Less sensitive to noise
• Low variance, higher bias
• Risk of underfitting

Distance Metrics

Metric	Formula	Use Case	Properties
Euclidean	√Σ(xi - yi)²	Continuous features, geometric data	Straight-line distance, sensitive to scale
Manhattan	Σ\|xi - yi\|	High-dimensional data, city-block distances	Less sensitive to outliers
Cosine	1 - (A·B)/(\|\|A\|\|\|\|B\|\|)	Text classification, sparse data	Measures angle, not magnitude
Hamming	Σ(xi ≠ yi)	Categorical data, binary strings	Counts differing positions

When to Use KNN

✓ Good For:

• Non-linear decision boundaries
• Local patterns in data
• Small to medium datasets
• Multi-class classification
• Recommendation systems
• Anomaly detection
• When data distribution is unknown

✗ Avoid When:

• High-dimensional data (curse of dimensionality)
• Large datasets (slow prediction)
• Real-time applications
• Imbalanced datasets
• Features have very different scales
• Sparse data
• Need interpretable models

Pros and Cons

Pros	Cons
Simple to understand and implement	Computationally expensive at prediction time
No assumptions about data distribution	Sensitive to irrelevant features
Works well with small datasets	Suffers from curse of dimensionality
Can be used for both classification and regression	Sensitive to local structure of data
Adapts to new data automatically	Memory intensive (stores all training data)

Key Takeaways

KNN is a simple, non-parametric algorithm that makes no assumptions about data distribution
The choice of k is crucial: small k leads to overfitting, large k leads to underfitting
Feature scaling is essential since KNN relies on distance calculations
Distance metric selection depends on your data type and problem domain
KNN is computationally expensive at prediction time but requires no training
Works well for local patterns but struggles with high-dimensional data