Large-scale dataset of short videos with textual descriptions
WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from the web. The videos are diverse and rich in their content. 10.7M video-caption pairs. 52K total video hours.