20+ Open Source Computer Vision Datasets

What is the best place to find computer vision datasets? Check out this list of 20+ curated image and video datasets and start annotating data and training your models today.
Read time
min read  ·  
August 4, 2021
Covid-19 X-ray dataset in V7

AI is driven by data—not code.

This bold statement could have sounded outlandish a few years back, but not anymore. However—

There is still one problem.

Quality training data can be really hard to access. It might take you days or weeks to find a suitable dataset for your computer vision tasks.

But, worry not.

In this article, we've put together a comprehensive list of quality computer vision datasets that you can access for free.

Have a look.

Accurate AI file analysis at any scale

Turn images, PDFs, or free-form text into structured insights

Ready to streamline AI product deployment right away? Check out:

COVID-19 X-Ray Dataset (V7)

It is V7’s original dataset containing 6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 amongst these. 

Each image contains:

  • Two "Lung" segmentation masks
  • A tag for the type of pneumonia (viral, bacterial, fungal, healthy/none)
  • If the patient has COVID-19, additional tags stating age, sex, temperature, location, intubation status, ICU admission, and patient outcome.

Lung annotations are polygons following pixel-level boundaries. You can export them in COCO, VOC, or Darwin JSON formats. Each annotation file contains a URL to the original full resolution image and a reduced size thumbnail.

For more details, check out: COVID-19 X-Ray dataset (Github)

CIFAR-10 & CIFAR-100

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

CIFAR-10 contains 60000 32x32 color images with 10 classes (animals and real-life objects). There are 6000 images per class. This dataset has 50000 training images and 10000 test images. The classes are mutually exclusive, without any overlaps.

CIFAR-100 consists of 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. 


ImageNet is one of the most popular image databases with more than 14 million hand-annotated images.

This database is organized according to the WordNet hierarchy (currently only the nouns), in which hundreds and thousands of images depict each node of the hierarchy. Object-level annotations provide a bounding box around the (visible part of the) indicated object. 


It is a large video dataset consisting of 650,000 clips covering 700 human action classes. 

The videos include human-object interactions like playing instruments and human-human interactions like hugging. Each action class has at least 700 video clips, and each clip is annotated with an action class lasting for about 10 seconds.


It’s a large database of handwritten single digits containing 60,000 training images and 10,000 testing images. 

It was released in 1999 and is used for classification tasks.


LSUN (The Large-scale Scene Understanding) contains close to one million labeled images for each of 10 scene categories and 20 object categories. 

For training data, each category contains from 120,000 to even 300,000,000 images. The validation data includes 300 images, and the test data has 1000 images for each category.

💡 Pro tip: Check out The Train, Validation, and Test Sets: How to Split Your Machine Learning Data to learn more.


It is one of the largest publicly available datasets of human faces with gender, age, and name. 

It contains 523,051 images in total, with 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia.


The MS COCO (Microsoft Common Objects in Context) dataset is consisting of 328K images. It contains annotations for object detection, keypoints detection, panoptic segmentation, stuff image segmentation, captioning, and Dense human pose estimation.

Labeled Faces in the Wild

It is a large-scale database of 13.000 face photographs designed for facial recognition tasks. Each face has been labeled with the person’s name.


Cityscapes is a database containing a diverse set of stereo video sequences recorded in street scenes from 50 different cities. The images were captured over time in various light conditions and weather. 

Cityscapes dataset includes semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories. It provides pixel-level annotations of 5000 frames and 20,000 coarsely annotated frames.


This dataset contains 50,000 JPEG images (40,000 for training and 10,000 for testing) with 12 classes. The images are extracted from LabelMe.

Classes include objects such as a car, a person, a tree, or a keyboard. 50% of the images in the training and testing set show a centered object, while the remaining 50% show a randomly selected region of a randomly selected image ("clutter").

This dataset can be used for object recognition.


Places dataset consists of 2.5 million images (with a category label) and 205 scene categories. There are more than 5,000 images per category. It’s trained using CNNs and can be used for scene recognition tasks.

Places2 (365-Standard)

Another dataset contributed by MIT. There are 1.8 million images from 365 scene categories. The dataset contains 50 images per category in the validation set and 900 in the testing set. Places2 Database can be used for scene recognition and generic deep scene features for visual recognition. 


It is a large dataset and knowledge base with 108,077 images with annotated objects, attributes, and their relationships.

Stanford Dogs 

This dataset has been built using images and annotations (class labels, bounding boxes) from ImageNet. It is a large-scale dataset containing images of 120 breeds of dogs from around the world. There are 20.580 images and 120 categories. 

Stanford Cars 

This dataset contains 16,185 images and 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. 

You have to download the images and their class labels and bounding boxes separately.

Cat Dataset 

The CAT dataset includes over 9,000 cat images with annotated facial features. There are annotations of the cat’s head with nine points for each image: two for eyes, one for the mouth, and six for the ears.


CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200.000 celebrity images, each with 40 attribute annotations. The annotations include 10,177 unique identities and five landmark locations per image.

The dataset can be used as training and test sets for face detection, face attribute recognition, localization, and landmark (or facial part) localization.

Face Mask Detection

This dataset contains 853 images belonging to the 3 classes and their bounding boxes in the PASCAL VOC format. The classes include “with mask”, “without mask” and “Mask worn incorrectly”.

Fire and Smoke Dataset

It is a dataset with more than 7000 unique images in HD resolution. 

It consists of early fire and smoke images captured using mobile phones in real-world scenarios. The images were captured under a wide variety of lighting conditions and weather. This dataset can be used for fire and smoke recognition, detection, plus anomaly detection.

It also contains various domestic scenes, including garbage and field crop burning, as well as domestic cooking, etc.

FloodNet Dataset

This dataset consists of high-resolution UAS imageries with detailed semantic annotation regarding the damages caused by hurricanes.

The data is collected with a small UAS platform, DJI Mavic Pro quadcopters, after Hurricane Harvey. The whole dataset has 2343 images, divided into training (~60%), validation (~20%), and test (~20%) sets. 

PS. Floodnet Dataset was annotated using V7.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Over to you: Next steps

Curious to learn more about labeling and training data?

Here’s a couple of resources to get you started:

  1. What is Data Labeling and How to Do It Efficiently [Tutorial]
  2. Annotating With Bounding Boxes: Quality Best Practices
  3. Data Cleaning Checklist: How to Prepare Your Machine Learning Data
  4. 15+ Top Computer Vision Project Ideas for Beginners

And if you are ready to take action, check out:

  1. 13 Best Image Annotation Tools
  2. Data Annotation Tutorial: Definition, Tools, Datasets
  3. Automated Annotation with V7

Previously CEO at Aipoly - First smartphone engine for convolutional neural networks. Management & Stats grad at Cass Business School and Singularity University. Never had a real job.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product