Computer vision

21+ Best Healthcare Datasets for Computer Vision

6 min read

—

Jul 15, 2021

The use of computer vision in medical imaging analysis has a plethora of benefits. The downside? Limited access to data. Check out this list of open healthcare datasets to find data faster.

Alberto Rizzoli

Co-founder & CEO

Here’s some food for thought—

Image data accounts for about 90 percent of all healthcare input data.

It creates a multitude of opportunities for training computer vision algorithms to improve diagnostic accuracy, enhance care delivery, or automate medical records management.

However—

Medical data is often fragmented, messy, and hard to access. It might take you hours to find relevant datasets.

Hence, we’ve curated a list of open-source healthcare datasets that you can use for medical imagining annotation.

Here’s what we’ll cover:

General health and scientific research
COVID-19 datasets
CT datasets
MRI datasets
X-Ray datasets
Other healthcare datasets
Dataset aggregators

Pro tip: Looking for a tool to label your medical data? Check out Medical Image Annotation with V7.

P.S. We will regularly update this list, so feel free to suggest the datasets you are using and we will make sure to add them.

Medical imaging annotation

Medical data labeling

Get started today

Medical imaging annotation

Medical data labeling

Get started today

General health and scientific research

NLM's MedPix

A free online Medical Image Database with over 59,000 indexed and curated images from over 12,000 patients.

The Cancer Imaging Archive (TCIA)

TCIA is a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download.

The data is organized as “collections”—typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc), or research focus.

DICOM is the primary file format used by TCIA for radiology imaging.

There’s also supporting data related to the images, such as patient outcomes, treatment details, genomics, and expert analyses.

Re3Data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It was launched in 2012 and funded by the German Research Foundation (DFG).

Re3Data contains data from over 2000 research subjects defined across several broad categories.

COVID-19 datasets

V7 COVID-19 X-Ray dataset

This COVID-19 X-Ray dataset contains 6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 amongst these.

Each image contains:

Two "Lung" segmentation masks (rendered as polygons, including the posterior region behind the heart).
A tag for the type of pneumonia (viral, bacterial, fungal, healthy/none)
If the patient has COVID-19, additional tags stating age, sex, temperature, location, intubation status, ICU admission, and patient outcome.

Lung annotations are polygons following pixel-level boundaries. They can be exported as COCO, VOC, or Darwin JSON formats.

COVID-19 image dataset

COVID-19 image dataset includes 137 cleaned images of COVID-19 and 317 images in total containing Viral Pneumonia and Normal Chest X-Rays structured into the test and train directories.

COVID-19 CT scans

COVID-19 CT scans is a small dataset with 20 CT scans and expert segmentations of patients with COVID-19.

CT datasets

CT Medical Images

CT Medical Images dataset is a small subset of images from the cancer imaging archive.

It consists of the middle slice of all CT images with age, modality, and contrast tags.This results in 475 series from 69 different patients.

Deep Lesion

Deep Lesion is of the largest image sets currently available. It contains CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. Deep Lesion includes over 32,000 lesions from 4000 unique patients.

Public Lung Database

The current Public Lung Database contains a limited number of annotated CT image scans highlighting many of the key issues in measuring large lesions in the lung.

All images are freely available for download.

VIA Group Public Databases

VIA Group Public Databases contains two public image datasets with lung CT images in the DICOM format together with documentation of abnormalities by radiologists.

MRI datasets

OASIS Brains Datasets

The Open Access Series of Imaging Studies (OASIS) aims to make MRI data sets of the brain freely available to the scientific community.

It provides access to a database of neuroimaging and processed imaging data across a broad demographic, cognitive, and genetic spectrum for use in neuroimaging, clinical, and cognitive research on normal aging and cognitive decline.

The database currently contains three separate datasets: OASIS-1, OASIS-2, and OASIS-3.

MRNet: Knee MRI's

The MRNet: Knee MRI dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center.

The dataset contains 1,104 abnormal exams, with 319 ACL tears and 508 meniscal tears. All the labels were obtained through manual extraction from clinical reports.

IVDM3Seg

IVDM3Seg contains 24 3D multi-modality MRI data sets of at least 7 IVDs of the lower spine, collected from 12 subjects in two different stages in a study investigating the effect of prolonged bed rest (spaceflight simulation) on the lumbar intervertebral discs.

In total, there are 96 high-resolution 3D MRI volume data. For each IVD, reference manual segmentation is provided in the form of a binary mask. All images (four volumes per patient) and binary masks (one binary volume per patient) are stored in the Neuroimaging Informatics Technology Initiative (NIFTI) file format.

X-Ray datasets

NIH Database of 100,000 Chest X-Rays

NIH dataset contains over 112,000 Chest X-ray images from more than 30,000 unique patients

ChestX-Det-Dataset

Chest X-Ray dataset with instance-level annotations, including instance-level annotations of 13 categories of disease/abnormality of 3,578 images.

The 13 categories are Atelectasis, Calcification, Cardiomegaly, Consolidation, Diffuse Nodule, Effusion, Emphysema, Fibrosis, Fracture, Mass, Nodule, Pleural Thickening, Pneumothorax.

CheXpert

CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017.

It includes associated radiology reports.

SCR database: Segmentation in Chest Radiographs

SCR database included digital chest X-ray images with segmentations of lung fields, heart, and clavicles. All chest radiographs are taken from the JSRT database - a publicly available database with 247 PA chest radiographs collected from 13 institutions in Japan and one in the United States.

The are 154 images that contain exactly one pulmonary lung nodule each; the other 93 images contain no lung nodules.

MURA: MSK Xrays

MURA is a dataset of musculoskeletal radiographs consisting of 14,863 studies from 12,173 patients, with a total of 40,561 multi-view radiographic images.

Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist.

Pro tip: Check out 6 Innovative Artificial Intelligence Applications in Dentistry.

Other healthcare datasets

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. The STARE Project was conceived and initiated in 1975 by Michael Goldbaum, M.D., at the University of California, San Diego, and funded by the U.S. National Institutes of Health.

It contains 20 equal-sized (700×605) color fundus images.

Dataset aggregators

OpenNEURO

OpenNEURO is a free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data. It currently offers 562 public datasets.

Kaggle

Kaggle is a data science community platform with tools and resources including externally contributed machine learning datasets of all kinds. To find health-related datasets, you can use a search bar with the keyword or topic you are interested in.

UCI Machine Learning Repository

UCI Machine Learning Repository is one of the oldest dataset aggregators on the web. All datasets are user-contributed, and you can download them without registration. They are categorized by task, attribute, data type, and area of expertise.

Pro tip: Are you ready to start annotating your data? Check out this Data Annotation Guide and the list of the 13 Best Image Annotation Tools.

Conclusion

There's no doubt that computer vision is already revolutionizing the healthcare industry.

It's been successfully implemented across a wide spectrum of medical procedures, and the growing demand for automated data processing will only contribute to further advancements in the deep learning field.

Easy access to quality health data is the fundamental building block that will fuel innovation and transform the healthcare system in the years to come.‍

Data labeling

Data labeling platform

Get started today

Data labeling

Data labeling platform

Get started today

Alberto Rizzoli

Co-founder & CEO of V7

Alberto Rizzoli

Co-founder & CEO of V7

Previously CEO at Aipoly - First smartphone engine for convolutional neural networks. Management & Stats grad at Cass Business School and Singularity University. Never had a real job.

Next steps

Label videos with V7.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Book a demo

Explore V7 Darwin

Book a demo

Explore V7 Darwin