Elastic Net Regularization
Elastic Net combines L1 and L2 regularization, offering a balance between feature selection (sparsity) and coefficient shrinkage. It addresses limitations of both Lasso and Ridge regression while maintaining their benefits.
Mathematical Formulation
Elastic Net Loss Function
Loss = Original Loss + λ[α·Σ|wᵢ| + (1-α)·Σwᵢ²]
Where: λ controls overall regularization strength, α ∈ [0,1] controls L1/L2 mix
α = 1
Pure L1 (Lasso)
α = 0.5
Equal mix
α = 0
Pure L2 (Ridge)
Comparative Visualization
0.50
0.30
Current mix: 50% L1 + 50% L2
Regularization Path
Weight Shrinkage as λ Increases
Shows how weights shrink differently under L1, L2, and Elastic Net as regularization increases
When to Use Elastic Net
Scenario | Why Elastic Net | Typical α Value |
---|---|---|
Correlated features | L1 alone is unstable; Elastic Net groups correlated features | 0.5 - 0.7 |
p >> n (many features) | Combines feature selection with stability | 0.7 - 0.9 |
Feature groups | Tends to select/drop groups together | 0.3 - 0.7 |
Unknown sparsity | Flexible between sparse and dense solutions | 0.5 |
Genomic data | Handles correlation between genes | 0.5 - 0.9 |
Advantages Over Pure L1/L2
vs Pure L1 (Lasso)
- More stable with correlated features
- Can select more than n features when p > n
- Grouping effect for correlated variables
- Less sensitive to data perturbations
vs Pure L2 (Ridge)
- Can produce sparse models
- Better feature selection
- More interpretable results
- Handles irrelevant features better
Comparison Table
Aspect | L1 (Lasso) | L2 (Ridge) | Elastic Net |
---|---|---|---|
Feature Selection | Yes - aggressive | No | Yes - flexible |
Correlated Features | Selects one randomly | Keeps all | Groups them |
Computation | Moderate | Fast | Slower |
p > n Case | Selects at most n | Handles well | No limit |
Interpretability | High | Low | High |
Stability | Low | High | Medium-High |
Practical Implementation
Hyperparameter Tuning
- • Use cross-validation to tune both λ and α
- • Common approach: Grid search over α ∈ {0.1, 0.5, 0.7, 0.9}
- • For each α, find optimal λ using coordinate descent
- • Consider warm starts for efficiency
Code Example (scikit-learn)
from sklearn.linear_model import ElasticNetCV
# Automatic CV for lambda, fixed alpha
model = ElasticNetCV(
l1_ratio=0.5, # alpha parameter
cv=5,
random_state=42
)
# Or use GridSearchCV for both parameters
from sklearn.model_selection import GridSearchCV
param_grid = {
'alpha': [0.1, 0.5, 1.0],
'l1_ratio': [0.1, 0.5, 0.7, 0.9]
}
elastic_net = ElasticNet()
grid_search = GridSearchCV(elastic_net, param_grid, cv=5)
grid_search.fit(X, y)
Best Practices
- • Standardize features before applying Elastic Net
- • Start with α = 0.5 as a baseline
- • Use α closer to 1 when you need more sparsity
- • Use α closer to 0 when features are highly correlated
- • Monitor both training and validation performance
- • Consider stability selection for robust feature selection
- • Use warm starts when tuning hyperparameters
- • Check coefficient paths to understand regularization effect