Explore the most influential datasets that power modern machine learning research and applications. Each dataset includes format previews, use cases, and notable papers that cite them.
Filter by domain:
Tip: Click on any dataset row to see format preview and example content. All datasets listed are freely available for research purposes, though licenses vary.
Dataset ↑
Provider
Domain
Size / Samples
Format
Use Cases
Link
AudioSet(2017)
Large-scale audio event dataset with 632 classes. Covers everything from speech to music to environmental sounds.
Google Research
Audio Recognition
2.1M clips
10-second clips
YouTube IDs + timestamps
Sound classificationAudio tagging+2 more
CIFAR-10/100(2009)
32x32 color images in 10/100 classes. Perfect for testing new architectures and techniques quickly.
Canadian Institute for Advanced Research
Computer Vision
170 MB / 170 MB
60K / 60K images
Binary (pickled)
Architecture designHyperparameter tuning+2 more
Common Crawl(2011)
Massive web crawl data updated monthly, used to train most modern language models including GPT and BERT.
Common Crawl Foundation
Natural Language Processing
300+ TB
Billions of web pages
WARC, WET, WAT files
Language model pre-trainingWeb mining+2 more
GLUE Benchmark(2018)
Collection of 9 English tasks for evaluating NLU systems. Standard benchmark for language understanding.
NYU, University of Washington, DeepMind
Natural Language Processing
1 GB
100K+ examples
TSV/CSV files
Model evaluationTransfer learning assessment+2 more
Gymnasium (OpenAI Gym)(2016)
Maintained fork of OpenAI Gym. Suite of RL environments from classic control to Atari games. Standard interface for RL research.
Farama Foundation
Reinforcement Learning
N/A (Simulated)
Unlimited episodes
Python API
RL algorithm developmentBenchmarking+2 more
HuggingFace Datasets(2020)
Massive collection of datasets with unified API. Includes most major ML datasets and many unique ones.
HuggingFace
Various
Varies
10,000+ datasets
Arrow/Parquet
Quick experimentationReproducible research+2 more
ImageNet(2009)
Large-scale hierarchical image database with 21,841 categories, revolutionized deep learning in computer vision.
Stanford Vision Lab
Computer Vision
150 GB
14M+ images
JPEG images + XML annotations
Image classificationObject detection+2 more
Kinetics-700(2017)
Human action videos covering 700 classes. Leading dataset for action recognition and video understanding.
DeepMind
Video Understanding
650K clips
10-second clips
YouTube URLs + timestamps
Action recognitionVideo classification+2 more
LAION-5B(2022)
Largest openly available image-text dataset. Powers open-source text-to-image models.
LAION
Multimodal
240 TB
5.85B image-text pairs
Image URLs + captions
Text-to-image trainingCLIP training+2 more
LibriSpeech(2015)
Clean and other speech from audiobooks. De facto standard for English speech recognition research.
OpenSLR
Speech Recognition
60 GB
1000 hours
FLAC audio + transcripts
ASR trainingSpeaker recognition+2 more
MNIST(1998)
Handwritten digits 0-9, the "Hello World" of machine learning. Simple but effective for learning basics.
Yann LeCun, NYU
Computer Vision
11 MB
70K images
IDX files (custom binary)
Classification tutorialsNeural network demos+2 more
MS COCO(2014)
Complex everyday scenes with common objects in natural contexts. Gold standard for object detection and segmentation.
Microsoft
Computer Vision
25 GB
330K images
JPEG + JSON annotations
Object detectionInstance segmentation+2 more
OpenWebText(2019)
Open-source recreation of GPT-2's WebText dataset. Quality web content from Reddit submissions.
Aaron Gokaslan, Vanya Cohen
Natural Language Processing
40 GB
8M documents
Plain text
Language model trainingText generation+2 more
Pile(2020)
Diverse text corpus combining high-quality sources. Designed for training large language models.
EleutherAI
Natural Language Processing
825 GB
22 datasets combined
JSONL (zst compressed)
Large LM trainingMulti-domain learning+2 more
ShapeNet(2015)
Large-scale repository of 3D CAD models organized by WordNet. Essential for 3D deep learning research.
Princeton, Stanford, TTIC
3D Vision
~100 GB
51K models
OBJ, MTL files
3D reconstructionShape generation+2 more
SQuAD 2.0(2018)
Reading comprehension dataset with unanswerable questions. Tests both comprehension and knowing when not to answer.
Stanford NLP
Natural Language Processing
35 MB
150K questions
JSON
Question answeringReading comprehension+2 more
UCI ML Repository(1987)
Classic collection of datasets for traditional ML. Includes Iris, Wine, Adult Income, and many more.
UC Irvine
Various (Tabular)
Varies
600+ datasets
CSV, ARFF, Data Folder
ML educationAlgorithm testing+2 more
WikiText-103(2016)
Extracted from Wikipedia articles, maintaining document boundaries. Popular for language modeling benchmarks.
Salesforce Research
Natural Language Processing
516 MB
103M tokens
Plain text
Language modelingText generation+2 more
Getting Started
For beginners: Start with MNIST, CIFAR-10, or Iris from UCI ML Repository. These are small, well-documented, and perfect for learning.
For NLP: WikiText-103 and GLUE benchmark are excellent starting points. Use HuggingFace Datasets for easy access.
For Computer Vision: After MNIST/CIFAR, move to MS COCO for detection or ImageNet for classification tasks.
Dataset Considerations
•Always check the license before using a dataset commercially
•Consider dataset biases and ethical implications
•Verify data quality and look for cleaned versions