Dataset for audio-visual video understanding research
We introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization.Audio-Visual Event (AVE) dataset contains 4143 videos covering 28 event categories and videos in AVE are temporally labeled with audio-visual event boundaries.