Back

ACAV100M

Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

ACAV100M

We present an automated curation pipeline for audio-visual representation learning. We formulate an optimization problem where the goal is to find a subset that maximizes the mutual information between audio and visual channels of videos. This helps us find a subset with high audio-visual correspondence, which could be useful for self-supervised audio-visual representation learning.Using our approach, we created datasets at varying scales from a large collection of unlabeled videos an unprecedented scale: We process 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).

Try V7 now
->
Seoul National University
View author website
Task
Video Classification
Annotation Types
Classification Tags
140000000
Items
5
Classes
140000000
Labels
Models using this dataset
Last updated on 
October 31, 2023
Licensed under 
Research Only
Blog
Learn about machine learning and latests advancements in AI.
Read More
Playbooks
Discover how to optimize AI for your business.
Learn more
Case Studies
Discover how V7 empowers AI industry greats.
Explore now
Webinars
Explore AI topics, gain insights, and learn from experts.
Watch now