Learning Transferable Visual Models From Natural Language Supervision
Paper Summary
CLIP (Contrastive Language-Image Pre-training) learns visual concepts from natural language supervision, enabling zero-shot transfer to downstream tasks. By training on 400 million image-text pairs, CLIP matches or exceeds supervised models on many datasets without using any labeled examples.
Abstract
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn state-of-the-art image representations from scratch on a dataset of 400 million image-text pairs collected from the internet.
Critical Analysis & Questions for Consideration
CLIP's vision-language pre-training opened new possibilities for zero-shot transfer, but critical examination reveals important limitations in its approach and evaluation.
Paradigm-Defining Achievement
CLIP showed that natural language supervision can replace manual labeling for vision tasks, enabling zero-shot transfer that democratized computer vision applications and spawned entire new research directions.
Data Collection Opacity
The paper provides minimal details about how 400M image-text pairs were collected and filtered. This lack of transparency makes it impossible to audit for biases or reproduce the dataset.
Zero-Shot Claims Inflated
Many "zero-shot" evaluations use prompt engineering that effectively does few-shot learning. The distinction between true zero-shot and prompted performance is blurred throughout the paper.
Typographic Attacks
CLIP is trivially fooled by text overlaid on images (apple with "iPod" text classified as iPod). This fundamental vulnerability to typographic attacks questions its visual understanding.
Comparison Unfairness
Comparing CLIP's zero-shot performance to supervised models trained on limited data isn't quite fair - CLIP saw 400M samples. Is this really zero-shot or just different supervision?
Social Bias Underexplored
While acknowledging biases, the paper doesn't deeply investigate how web-scraped data amplifies harmful stereotypes. Given CLIP's widespread deployment, this deserved more rigorous analysis.