<- Back to Datasets

ACAV100M

Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

ACAV100M

We present an automated curation pipeline for audio-visual representation learning. We formulate an optimization problem where the goal is to find a subset that maximizes the mutual information between audio and visual channels of videos. This helps us find a subset with high audio-visual correspondence, which could be useful for self-supervised audio-visual representation learning.Using our approach, we created datasets at varying scales from a large collection of unlabeled videos an unprecedented scale: We process 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).

View this Dataset
->
Seoul National University
View author website
Task
Video Classification
Annotation Types
Classification Tags
140000000
Items
5
Classes
140000000
Labels
Models using this dataset
Last updated on 
January 20, 2022
Licensed under 
Research Only
Gain control of your training data
15,000+ ML engineers can’t be wrong