Back

Visual Commonsense Reasoning (VCR)

Large-scale dataset for cognition-level visual understanding

Visual Commonsense Reasoning (VCR)

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).

View this Dataset
->
University of Washington
View author website
Task
Visual Question Answering
Annotation Types
Semantic Segmentation
110000
Items
Classes
110000
Labels
Models using this dataset
Last updated on 
January 20, 2022
Licensed under 
Research Only