Open Source Datasets

Explore the most influential datasets that power modern machine learning research and applications. Each dataset includes format previews, use cases, and notable papers that cite them.

Filter by domain:

Tip: Click on any dataset row to see format preview and example content. All datasets listed are freely available for research purposes, though licenses vary.

Dataset ↑	Provider	Domain	Size / Samples	Format	Use Cases
AudioSet(2017) Large-scale audio event dataset with 632 classes. Covers everything from speech to music to environmental sounds.	Google Research	Audio Recognition	2.1M clips 10-second clips	YouTube IDs + timestamps	Sound classificationAudio tagging+2 more
CIFAR-10/100(2009) 32x32 color images in 10/100 classes. Perfect for testing new architectures and techniques quickly.	Canadian Institute for Advanced Research	Computer Vision	170 MB / 170 MB 60K / 60K images	Binary (pickled)	Architecture designHyperparameter tuning+2 more
Common Crawl(2011) Massive web crawl data updated monthly, used to train most modern language models including GPT and BERT.	Common Crawl Foundation	Natural Language Processing	300+ TB Billions of web pages	WARC, WET, WAT files	Language model pre-trainingWeb mining+2 more
GLUE Benchmark(2018) Collection of 9 English tasks for evaluating NLU systems. Standard benchmark for language understanding.	NYU, University of Washington, DeepMind	Natural Language Processing	1 GB 100K+ examples	TSV/CSV files	Model evaluationTransfer learning assessment+2 more
Gymnasium (OpenAI Gym)(2016) Maintained fork of OpenAI Gym. Suite of RL environments from classic control to Atari games. Standard interface for RL research.	Farama Foundation	Reinforcement Learning	N/A (Simulated) Unlimited episodes	Python API	RL algorithm developmentBenchmarking+2 more
HuggingFace Datasets(2020) Massive collection of datasets with unified API. Includes most major ML datasets and many unique ones.	HuggingFace	Various	Varies 10,000+ datasets	Arrow/Parquet	Quick experimentationReproducible research+2 more
ImageNet(2009) Large-scale hierarchical image database with 21,841 categories, revolutionized deep learning in computer vision.	Stanford Vision Lab	Computer Vision	150 GB 14M+ images	JPEG images + XML annotations	Image classificationObject detection+2 more
Kinetics-700(2017) Human action videos covering 700 classes. Leading dataset for action recognition and video understanding.	DeepMind	Video Understanding	650K clips 10-second clips	YouTube URLs + timestamps	Action recognitionVideo classification+2 more
LAION-5B(2022) Largest openly available image-text dataset. Powers open-source text-to-image models.	LAION	Multimodal	240 TB 5.85B image-text pairs	Image URLs + captions	Text-to-image trainingCLIP training+2 more
LibriSpeech(2015) Clean and other speech from audiobooks. De facto standard for English speech recognition research.	OpenSLR	Speech Recognition	60 GB 1000 hours	FLAC audio + transcripts	ASR trainingSpeaker recognition+2 more
MNIST(1998) Handwritten digits 0-9, the "Hello World" of machine learning. Simple but effective for learning basics.	Yann LeCun, NYU	Computer Vision	11 MB 70K images	IDX files (custom binary)	Classification tutorialsNeural network demos+2 more
MS COCO(2014) Complex everyday scenes with common objects in natural contexts. Gold standard for object detection and segmentation.	Microsoft	Computer Vision	25 GB 330K images	JPEG + JSON annotations	Object detectionInstance segmentation+2 more
OpenWebText(2019) Open-source recreation of GPT-2's WebText dataset. Quality web content from Reddit submissions.	Aaron Gokaslan, Vanya Cohen	Natural Language Processing	40 GB 8M documents	Plain text	Language model trainingText generation+2 more
Pile(2020) Diverse text corpus combining high-quality sources. Designed for training large language models.	EleutherAI	Natural Language Processing	825 GB 22 datasets combined	JSONL (zst compressed)	Large LM trainingMulti-domain learning+2 more
ShapeNet(2015) Large-scale repository of 3D CAD models organized by WordNet. Essential for 3D deep learning research.	Princeton, Stanford, TTIC	3D Vision	~100 GB 51K models	OBJ, MTL files	3D reconstructionShape generation+2 more
SQuAD 2.0(2018) Reading comprehension dataset with unanswerable questions. Tests both comprehension and knowing when not to answer.	Stanford NLP	Natural Language Processing	35 MB 150K questions	JSON	Question answeringReading comprehension+2 more
UCI ML Repository(1987) Classic collection of datasets for traditional ML. Includes Iris, Wine, Adult Income, and many more.	UC Irvine	Various (Tabular)	Varies 600+ datasets	CSV, ARFF, Data Folder	ML educationAlgorithm testing+2 more
WikiText-103(2016) Extracted from Wikipedia articles, maintaining document boundaries. Popular for language modeling benchmarks.	Salesforce Research	Natural Language Processing	516 MB 103M tokens	Plain text	Language modelingText generation+2 more

Getting Started

For beginners: Start with MNIST, CIFAR-10, or Iris from UCI ML Repository. These are small, well-documented, and perfect for learning.

For NLP: WikiText-103 and GLUE benchmark are excellent starting points. Use HuggingFace Datasets for easy access.

For Computer Vision: After MNIST/CIFAR, move to MS COCO for detection or ImageNet for classification tasks.

Dataset Considerations

•Always check the license before using a dataset commercially
•Consider dataset biases and ethical implications
•Verify data quality and look for cleaned versions
•Use appropriate train/validation/test splits