Data Augmentation

Data augmentation is a regularization technique that artificially increases the size of your training dataset by creating modified versions of existing data. This helps the model generalize better and reduces overfitting.

Interactive Demonstration

Select Augmentation Type:

Common Augmentation Techniques

Technique	Description	Best For	Parameters
Rotation	Rotate images by random angles	Natural images, digits	Max angle (degrees)
Translation	Shift images horizontally/vertically	Object detection, general	Max shift (pixels or %)
Scaling/Zoom	Zoom in/out randomly	Variable object sizes	Scale range
Flipping	Horizontal/vertical mirror	Natural images (not text)	Flip probability
Shearing	Slant transformation	Handwriting, perspective	Shear angle
Brightness/Contrast	Adjust image lighting	Real-world conditions	Adjustment range
Noise Addition	Add random noise	Robustness to noise	Noise level
Cutout/Erasing	Remove random patches	Occlusion handling	Patch size, count

Domain-Specific Augmentations

Computer Vision

MixUp: Blend two images and labels
CutMix: Mix rectangular patches
AutoAugment: Learn optimal policies
Color jittering: Hue, saturation changes
Elastic deformations: For medical images

Natural Language Processing

Synonym replacement
Random insertion/deletion
Back-translation
Paraphrasing
Contextual word embeddings

When to Use Data Augmentation

Scenario	Recommendation	Rationale
Small dataset	Highly Recommended	Increases effective dataset size
Class imbalance	Recommended	Augment minority classes more
High variance	Recommended	Reduces overfitting
Domain shift expected	Recommended	Improves robustness
Large dataset	Optional	May still improve generalization
Real-time training	Use Carefully	Can slow training significantly

Pros and Cons

Advantages

Increases dataset size without collecting new data
Reduces overfitting significantly
Improves model robustness
Can handle class imbalance
Domain-specific knowledge integration
Often easy to implement

Disadvantages

Increases training time
Can introduce unrealistic samples
Requires domain expertise
May need careful validation
Not all augmentations help
Can hide data collection needs

Implementation Best Practices

• Apply augmentations only to training data, not validation/test
• Use online augmentation (on-the-fly) to save storage
• Start with simple augmentations, add complexity gradually
• Validate that augmentations preserve label correctness
• Consider augmentation probability (not all samples need augmentation)
• Monitor augmented samples visually during development
• Combine multiple augmentations for stronger effect
• Use libraries like Albumentations, imgaug, or torchvision.transforms