L1/L2 Regularization

Regularization prevents overfitting by adding a penalty term to the loss function that discourages large weights. L1 regularization promotes sparsity, while L2 regularization promotes small but non-zero weights.

Mathematical Formulation

L1 Regularization (Lasso)

Loss = Original Loss + λ × Σ|wᵢ|

Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + λ × sign(w)

L2 Regularization (Ridge)

Loss = Original Loss + λ × Σwᵢ²

Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + 2λw

λ (Lambda)

Regularization strength - higher values mean stronger regularization

Interactive Visualization

0.10

When to Use Each Type

AspectL1 RegularizationL2 Regularization
Feature SelectionYes - drives weights to exactly zeroNo - shrinks weights but keeps them non-zero
SparsityCreates sparse modelsKeeps all features
Computational EfficiencyMore efficient for sparse dataGenerally faster gradient computation
InterpretabilityBetter - fewer active featuresHarder - all features contribute
RobustnessLess robust to outliersMore robust to outliers

Pros and Cons

L1 Regularization

Pros:

  • Automatic feature selection
  • Creates interpretable models
  • Good for high-dimensional data
  • Reduces model complexity

Cons:

  • Can be unstable with correlated features
  • Non-differentiable at zero
  • May discard useful features
  • Less smooth optimization

L2 Regularization

Pros:

  • Smooth and differentiable everywhere
  • Stable with correlated features
  • Better for prediction accuracy
  • Computationally efficient

Cons:

  • Doesn't perform feature selection
  • All features remain in the model
  • Less interpretable
  • May not work well with irrelevant features

Practical Tips

  • • Start with L2 regularization as it's generally more stable
  • • Use L1 when you suspect many features are irrelevant
  • • Try Elastic Net (combines L1 and L2) for the best of both worlds
  • • Always use cross-validation to tune the regularization parameter λ
  • • Standardize features before applying regularization
  • • Monitor both training and validation loss to detect overfitting