A large scale audio-visual dataset
VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.VGG-Sound contains audios spanning a large number of challenging acoustic environments and noise characteristics of real applications. All videos are captured "in the wild" with audio-visual correspondence in the sense that the sound source is visually evident. VGG-Sound consists of both audio and video. Each segment is 10 seconds long.