Data Augmentation

Data augmentation is a regularization technique that artificially increases the size of your training dataset by creating modified versions of existing data. This helps the model generalize better and reduces overfitting.

Interactive Demonstration

Common Augmentation Techniques

TechniqueDescriptionBest ForParameters
RotationRotate images by random anglesNatural images, digitsMax angle (degrees)
TranslationShift images horizontally/verticallyObject detection, generalMax shift (pixels or %)
Scaling/ZoomZoom in/out randomlyVariable object sizesScale range
FlippingHorizontal/vertical mirrorNatural images (not text)Flip probability
ShearingSlant transformationHandwriting, perspectiveShear angle
Brightness/ContrastAdjust image lightingReal-world conditionsAdjustment range
Noise AdditionAdd random noiseRobustness to noiseNoise level
Cutout/ErasingRemove random patchesOcclusion handlingPatch size, count

Domain-Specific Augmentations

Computer Vision

  • MixUp: Blend two images and labels
  • CutMix: Mix rectangular patches
  • AutoAugment: Learn optimal policies
  • Color jittering: Hue, saturation changes
  • Elastic deformations: For medical images

Natural Language Processing

  • Synonym replacement
  • Random insertion/deletion
  • Back-translation
  • Paraphrasing
  • Contextual word embeddings

When to Use Data Augmentation

ScenarioRecommendationRationale
Small datasetHighly RecommendedIncreases effective dataset size
Class imbalanceRecommendedAugment minority classes more
High varianceRecommendedReduces overfitting
Domain shift expectedRecommendedImproves robustness
Large datasetOptionalMay still improve generalization
Real-time trainingUse CarefullyCan slow training significantly

Pros and Cons

Advantages

  • Increases dataset size without collecting new data
  • Reduces overfitting significantly
  • Improves model robustness
  • Can handle class imbalance
  • Domain-specific knowledge integration
  • Often easy to implement

Disadvantages

  • Increases training time
  • Can introduce unrealistic samples
  • Requires domain expertise
  • May need careful validation
  • Not all augmentations help
  • Can hide data collection needs

Implementation Best Practices

  • • Apply augmentations only to training data, not validation/test
  • • Use online augmentation (on-the-fly) to save storage
  • • Start with simple augmentations, add complexity gradually
  • • Validate that augmentations preserve label correctness
  • • Consider augmentation probability (not all samples need augmentation)
  • • Monitor augmented samples visually during development
  • • Combine multiple augmentations for stronger effect
  • • Use libraries like Albumentations, imgaug, or torchvision.transforms