Loss Functions & Optimization

Introduction

Loss functions measure how wrong our predictions are, while optimization algorithms help us minimize this loss. Together, they form the core of how machine learning models learn.

Interactive Loss Function Comparison

Explore different loss functions and see how they penalize prediction errors differently:

True Value (y): 0.70

Current Prediction (ŷ): 0.30

Select Loss Function

Mean Squared Error

L = (y - ŷ)²

Penalizes large errors more heavily. Common for regression.

Mean Absolute Error

L = |y - ŷ|

Treats all errors equally. Robust to outliers.

Huber Loss

L = 0.5(y - ŷ)² if |y - ŷ| ≤ δ, else δ|y - ŷ| - 0.5δ²

Combines MSE and MAE benefits. Smooth near zero, linear for large errors.

Binary Cross-Entropy

L = -[y log(ŷ) + (1-y) log(1-ŷ)]

For binary classification. Heavily penalizes confident wrong predictions.

Current Values

Loss: 0.1600

Gradient: -0.8000

The red arrow shows the direction to move the prediction to reduce loss

Loss Landscape & Gradient Descent

Visualize how gradient descent navigates a 2D loss landscape to find the minimum:

Loss Landscape (Rosenbrock Function)

Low loss

High loss

Optimization Controls

Learning Rate: 0.10

Current Position

w₁ = -1.500

w₂ = 1.500

Loss = 9.0625

Tips:

Small learning rates: slow but stable
Large learning rates: fast but may overshoot
The path shows how gradient descent navigates the loss surface
Notice how it follows the steepest descent direction

Common Loss Functions

Regression Losses

MSE: Standard for regression. Sensitive to outliers.
MAE: Robust to outliers. Less smooth gradient.
Huber: Best of both worlds. Smooth + robust.

Classification Losses

Cross-Entropy: Standard for classification. Well-suited for probabilities.
Hinge Loss: Used in SVMs. Creates margin.
Focal Loss: Addresses class imbalance.

Interactive Loss Function Explorer

Compare with other regression losses

Select Loss Function:

Mean Squared Error (MSE)

Formula:

L(y, ŷ) = (y - ŷ)²

Description:

Squares the difference between predicted and actual values. Heavily penalizes large errors, making it sensitive to outliers. Most common for regression tasks.

When to use:

Used in linear regression, neural networks for continuous predictions

Key Takeaways

Loss functions quantify how wrong our predictions are
Different loss functions are suited for different problems
Gradient descent uses the derivative to find the direction of steepest descent
Learning rate controls the step size in optimization
The loss landscape can have multiple local minima
Understanding loss functions and optimization is crucial for debugging ML models