L1/L2 Regularization

Regularization prevents overfitting by adding a penalty term to the loss function that discourages large weights. L1 regularization promotes sparsity, while L2 regularization promotes small but non-zero weights.

Mathematical Formulation

L1 Regularization (Lasso)

Loss = Original Loss + λ × Σ|wᵢ|

Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + λ × sign(w)

L2 Regularization (Ridge)

Loss = Original Loss + λ × Σwᵢ²

Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + 2λw

λ (Lambda)

Regularization strength - higher values mean stronger regularization

Interactive Visualization

Regularization Type:

Lambda (λ):0.10

When to Use Each Type

Aspect	L1 Regularization	L2 Regularization
Feature Selection	Yes - drives weights to exactly zero	No - shrinks weights but keeps them non-zero
Sparsity	Creates sparse models	Keeps all features
Computational Efficiency	More efficient for sparse data	Generally faster gradient computation
Interpretability	Better - fewer active features	Harder - all features contribute
Robustness	Less robust to outliers	More robust to outliers

Pros and Cons

L1 Regularization

Pros:

Automatic feature selection
Creates interpretable models
Good for high-dimensional data
Reduces model complexity

Cons:

Can be unstable with correlated features
Non-differentiable at zero
May discard useful features
Less smooth optimization

L2 Regularization

Pros:

Smooth and differentiable everywhere
Stable with correlated features
Better for prediction accuracy
Computationally efficient

Cons:

Doesn't perform feature selection
All features remain in the model
Less interpretable
May not work well with irrelevant features

Practical Tips

• Start with L2 regularization as it's generally more stable
• Use L1 when you suspect many features are irrelevant
• Try Elastic Net (combines L1 and L2) for the best of both worlds
• Always use cross-validation to tune the regularization parameter λ
• Standardize features before applying regularization
• Monitor both training and validation loss to detect overfitting