Data Augmentation
Data augmentation is a regularization technique that artificially increases the size of your training dataset by creating modified versions of existing data. This helps the model generalize better and reduces overfitting.
Interactive Demonstration
Common Augmentation Techniques
Technique | Description | Best For | Parameters |
---|---|---|---|
Rotation | Rotate images by random angles | Natural images, digits | Max angle (degrees) |
Translation | Shift images horizontally/vertically | Object detection, general | Max shift (pixels or %) |
Scaling/Zoom | Zoom in/out randomly | Variable object sizes | Scale range |
Flipping | Horizontal/vertical mirror | Natural images (not text) | Flip probability |
Shearing | Slant transformation | Handwriting, perspective | Shear angle |
Brightness/Contrast | Adjust image lighting | Real-world conditions | Adjustment range |
Noise Addition | Add random noise | Robustness to noise | Noise level |
Cutout/Erasing | Remove random patches | Occlusion handling | Patch size, count |
Domain-Specific Augmentations
Computer Vision
- MixUp: Blend two images and labels
- CutMix: Mix rectangular patches
- AutoAugment: Learn optimal policies
- Color jittering: Hue, saturation changes
- Elastic deformations: For medical images
Natural Language Processing
- Synonym replacement
- Random insertion/deletion
- Back-translation
- Paraphrasing
- Contextual word embeddings
When to Use Data Augmentation
Scenario | Recommendation | Rationale |
---|---|---|
Small dataset | Highly Recommended | Increases effective dataset size |
Class imbalance | Recommended | Augment minority classes more |
High variance | Recommended | Reduces overfitting |
Domain shift expected | Recommended | Improves robustness |
Large dataset | Optional | May still improve generalization |
Real-time training | Use Carefully | Can slow training significantly |
Pros and Cons
Advantages
- Increases dataset size without collecting new data
- Reduces overfitting significantly
- Improves model robustness
- Can handle class imbalance
- Domain-specific knowledge integration
- Often easy to implement
Disadvantages
- Increases training time
- Can introduce unrealistic samples
- Requires domain expertise
- May need careful validation
- Not all augmentations help
- Can hide data collection needs
Implementation Best Practices
- • Apply augmentations only to training data, not validation/test
- • Use online augmentation (on-the-fly) to save storage
- • Start with simple augmentations, add complexity gradually
- • Validate that augmentations preserve label correctness
- • Consider augmentation probability (not all samples need augmentation)
- • Monitor augmented samples visually during development
- • Combine multiple augmentations for stronger effect
- • Use libraries like Albumentations, imgaug, or torchvision.transforms