Back

LAION-400M

The world’s largest openly available image-text-pair dataset with 400 million samples

LAION-400M

The LAION-400M dataset is completely openly, freely accessible.All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

View this Dataset
->
TRULY OPEN AI
View author website
Task
Image Captioning
Annotation Types
Bounding Boxes
400000000
Items
7
Classes
400000000
Labels
Models using this dataset
Last updated on 
January 20, 2022
Licensed under 
CC-BY