Elastic Net Regularization

Elastic Net combines L1 and L2 regularization, offering a balance between feature selection (sparsity) and coefficient shrinkage. It addresses limitations of both Lasso and Ridge regression while maintaining their benefits.

Mathematical Formulation

Elastic Net Loss Function

Loss = Original Loss + λ[α·Σ|wᵢ| + (1-α)·Σwᵢ²]

Where: λ controls overall regularization strength, α ∈ [0,1] controls L1/L2 mix

α = 1

Pure L1 (Lasso)

α = 0.5

Equal mix

α = 0

Pure L2 (Ridge)

Comparative Visualization

L1/L2 Mix (α):0.50

Regularization Strength (λ):0.30

Current mix: 50% L1 + 50% L2

Regularization Path

Weight Shrinkage as λ Increases

Shows how weights shrink differently under L1, L2, and Elastic Net as regularization increases

When to Use Elastic Net

Scenario	Why Elastic Net	Typical α Value
Correlated features	L1 alone is unstable; Elastic Net groups correlated features	0.5 - 0.7
p >> n (many features)	Combines feature selection with stability	0.7 - 0.9
Feature groups	Tends to select/drop groups together	0.3 - 0.7
Unknown sparsity	Flexible between sparse and dense solutions	0.5
Genomic data	Handles correlation between genes	0.5 - 0.9

Advantages Over Pure L1/L2

vs Pure L1 (Lasso)

More stable with correlated features
Can select more than n features when p > n
Grouping effect for correlated variables
Less sensitive to data perturbations

vs Pure L2 (Ridge)

Can produce sparse models
Better feature selection
More interpretable results
Handles irrelevant features better

Comparison Table

Aspect	L1 (Lasso)	L2 (Ridge)	Elastic Net
Feature Selection	Yes - aggressive	No	Yes - flexible
Correlated Features	Selects one randomly	Keeps all	Groups them
Computation	Moderate	Fast	Slower
p > n Case	Selects at most n	Handles well	No limit
Interpretability	High	Low	High
Stability	Low	High	Medium-High

Practical Implementation

Hyperparameter Tuning

• Use cross-validation to tune both λ and α
• Common approach: Grid search over α ∈ {0.1, 0.5, 0.7, 0.9}
• For each α, find optimal λ using coordinate descent
• Consider warm starts for efficiency

Code Example (scikit-learn)

from sklearn.linear_model import ElasticNetCV

# Automatic CV for lambda, fixed alpha
model = ElasticNetCV(
    l1_ratio=0.5,  # alpha parameter
    cv=5,
    random_state=42
)

# Or use GridSearchCV for both parameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'alpha': [0.1, 0.5, 1.0],
    'l1_ratio': [0.1, 0.5, 0.7, 0.9]
}

elastic_net = ElasticNet()
grid_search = GridSearchCV(elastic_net, param_grid, cv=5)
grid_search.fit(X, y)

Best Practices

• Standardize features before applying Elastic Net
• Start with α = 0.5 as a baseline
• Use α closer to 1 when you need more sparsity
• Use α closer to 0 when features are highly correlated
• Monitor both training and validation performance
• Consider stability selection for robust feature selection
• Use warm starts when tuning hyperparameters
• Check coefficient paths to understand regularization effect