L1/L2 Regularization
Regularization prevents overfitting by adding a penalty term to the loss function that discourages large weights. L1 regularization promotes sparsity, while L2 regularization promotes small but non-zero weights.
Mathematical Formulation
L1 Regularization (Lasso)
Loss = Original Loss + λ × Σ|wᵢ|
Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + λ × sign(w)
L2 Regularization (Ridge)
Loss = Original Loss + λ × Σwᵢ²
Gradient: ∂Loss/∂w = ∂(Original Loss)/∂w + 2λw
λ (Lambda)
Regularization strength - higher values mean stronger regularization
Interactive Visualization
0.10
When to Use Each Type
Aspect | L1 Regularization | L2 Regularization |
---|---|---|
Feature Selection | Yes - drives weights to exactly zero | No - shrinks weights but keeps them non-zero |
Sparsity | Creates sparse models | Keeps all features |
Computational Efficiency | More efficient for sparse data | Generally faster gradient computation |
Interpretability | Better - fewer active features | Harder - all features contribute |
Robustness | Less robust to outliers | More robust to outliers |
Pros and Cons
L1 Regularization
Pros:
- Automatic feature selection
- Creates interpretable models
- Good for high-dimensional data
- Reduces model complexity
Cons:
- Can be unstable with correlated features
- Non-differentiable at zero
- May discard useful features
- Less smooth optimization
L2 Regularization
Pros:
- Smooth and differentiable everywhere
- Stable with correlated features
- Better for prediction accuracy
- Computationally efficient
Cons:
- Doesn't perform feature selection
- All features remain in the model
- Less interpretable
- May not work well with irrelevant features
Practical Tips
- • Start with L2 regularization as it's generally more stable
- • Use L1 when you suspect many features are irrelevant
- • Try Elastic Net (combines L1 and L2) for the best of both worlds
- • Always use cross-validation to tune the regularization parameter λ
- • Standardize features before applying regularization
- • Monitor both training and validation loss to detect overfitting