Elastic Net Regularization

Elastic Net combines L1 and L2 regularization, offering a balance between feature selection (sparsity) and coefficient shrinkage. It addresses limitations of both Lasso and Ridge regression while maintaining their benefits.

Mathematical Formulation

Elastic Net Loss Function

Loss = Original Loss + λ[α·Σ|wᵢ| + (1-α)·Σwᵢ²]

Where: λ controls overall regularization strength, α ∈ [0,1] controls L1/L2 mix

α = 1

Pure L1 (Lasso)

α = 0.5

Equal mix

α = 0

Pure L2 (Ridge)

Comparative Visualization

0.50
0.30

Current mix: 50% L1 + 50% L2

Regularization Path

Weight Shrinkage as λ Increases

Shows how weights shrink differently under L1, L2, and Elastic Net as regularization increases

When to Use Elastic Net

ScenarioWhy Elastic NetTypical α Value
Correlated featuresL1 alone is unstable; Elastic Net groups correlated features0.5 - 0.7
p >> n (many features)Combines feature selection with stability0.7 - 0.9
Feature groupsTends to select/drop groups together0.3 - 0.7
Unknown sparsityFlexible between sparse and dense solutions0.5
Genomic dataHandles correlation between genes0.5 - 0.9

Advantages Over Pure L1/L2

vs Pure L1 (Lasso)

  • More stable with correlated features
  • Can select more than n features when p > n
  • Grouping effect for correlated variables
  • Less sensitive to data perturbations

vs Pure L2 (Ridge)

  • Can produce sparse models
  • Better feature selection
  • More interpretable results
  • Handles irrelevant features better

Comparison Table

AspectL1 (Lasso)L2 (Ridge)Elastic Net
Feature SelectionYes - aggressiveNoYes - flexible
Correlated FeaturesSelects one randomlyKeeps allGroups them
ComputationModerateFastSlower
p > n CaseSelects at most nHandles wellNo limit
InterpretabilityHighLowHigh
StabilityLowHighMedium-High

Practical Implementation

Hyperparameter Tuning

  • • Use cross-validation to tune both λ and α
  • • Common approach: Grid search over α ∈ {0.1, 0.5, 0.7, 0.9}
  • • For each α, find optimal λ using coordinate descent
  • • Consider warm starts for efficiency

Code Example (scikit-learn)

from sklearn.linear_model import ElasticNetCV

# Automatic CV for lambda, fixed alpha
model = ElasticNetCV(
    l1_ratio=0.5,  # alpha parameter
    cv=5,
    random_state=42
)

# Or use GridSearchCV for both parameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'alpha': [0.1, 0.5, 1.0],
    'l1_ratio': [0.1, 0.5, 0.7, 0.9]
}

elastic_net = ElasticNet()
grid_search = GridSearchCV(elastic_net, param_grid, cv=5)
grid_search.fit(X, y)

Best Practices

  • • Standardize features before applying Elastic Net
  • • Start with α = 0.5 as a baseline
  • • Use α closer to 1 when you need more sparsity
  • • Use α closer to 0 when features are highly correlated
  • • Monitor both training and validation performance
  • • Consider stability selection for robust feature selection
  • • Use warm starts when tuning hyperparameters
  • • Check coefficient paths to understand regularization effect