21+ Best Healthcare Datasets for Computer Vision

The use of computer vision in medical imaging analysis has a plethora of benefits​. The downside? Limited access to data. Check out this list of open healthcare datasets to find data faster.

Here’s some food for thought—

Image data accounts for about 90 percent of all healthcare input data.

It creates a multitude of opportunities for training computer vision algorithms to improve diagnostic accuracy, enhance care delivery, or automate medical records management.

However— 

Medical data is often fragmented, messy, and hard to access. It might take you hours to find relevant datasets.

Hence, we’ve curated a list of open-source healthcare datasets that you can use for medical imagining annotation.

  1. General health and scientific research
  2. COVID-19 datasets
  3. CT datasets
  4. MRI datasets
  5. X-Ray datasets
  6. Other healthcare datasets
  7. Dataset aggregators
💡 Pro tip: Looking for a tool to label your medical data? Check out Medical Image Annotation with V7.

P.S. We will regularly update this list, so feel free to suggest the datasets you are using and we will make sure to add them.

General health and scientific research

NLM's MedPix 

A free online Medical Image Database with over 59,000 indexed and curated images from over 12,000 patients.

The Cancer Imaging Archive (TCIA)

TCIA is a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download. 

The data is organized as “collections”—typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc), or research focus.

DICOM is the primary file format used by TCIA for radiology imaging. 

There’s also supporting data related to the images, such as patient outcomes, treatment details, genomics, and expert analyses.

Re3Data

Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It was launched in 2012 and funded by the German Research Foundation (DFG).

Re3Data contains data from over 2000 research subjects defined across several broad categories. 

COVID-19 datasets

V7 COVID-19 X-ray dataset

This dataset contains 6500 images of AP/PA chest x-rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 amongst these. 

Each image contains:

  • Two "Lung" segmentation masks (rendered as polygons, including the posterior region behind the heart).
  • A tag for the type of pneumonia (viral, bacterial, fungal, healthy/none)
  • If the patient has COVID-19, additional tags stating age, sex, temperature, location, intubation status, ICU admission, and patient outcome.

Lung annotations are polygons following pixel-level boundaries. These can be exported as COCO, VOC, or Darwin JSON formats. Each annotation file contains a URL to the original full resolution image, as well as a reduced size thumbnail.

COVID-19 image dataset

It is a dataset with 137 cleaned images of COVID-19 and 317 images in total containing Viral Pneumonia and Normal Chest X-Rays structured into the test and train directories.

COVID-19 CT scans

It is a small dataset with 20 CT scans and expert segmentations of patients with COVID-19.

CT datasets

CT Medical Images

This dataset is a small subset of images from the cancer imaging archive. 

It consists of the middle slice of all CT images with age, modality, and contrast tags.This results in 475 series from 69 different patients.

Deep Lesion

It is of the largest image sets currently available. It contains CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. Deep Lesion includes over 32,000 lesions from 4000 unique patients.

Public Lung Database 

The current database contains a limited number of annotated CT image scans highlighting many of the key issues in measuring large lesions in the lung. 

All images are freely available for download.

VIA Group Public Databases

It contains two public image datasets with lung CT images in the DICOM format together with documentation of abnormalities by radiologists. 

MRI datasets

OASIS Brains Datasets

The Open Access Series of Imaging Studies (OASIS) aims to make MRI data sets of the brain freely available to the scientific community. 

It provides access to a database of neuroimaging and processed imaging data across a broad demographic, cognitive, and genetic spectrum for use in neuroimaging, clinical, and cognitive research on normal aging and cognitive decline.

The database currently contains three separate datasets: OASIS-1, OASIS-2, and OASIS-3.

MRNet: Knee MRI's

The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. 

The dataset contains 1,104 abnormal exams, with 319 ACL tears and 508 meniscal tears. All the labels were obtained through manual extraction from clinical reports.

IVDM3Seg 

It contains 24 3D multi-modality MRI data sets of at least 7 IVDs of the lower spine, collected from 12 subjects in two different stages in a study investigating the effect of prolonged bed rest (spaceflight simulation) on the lumbar intervertebral discs.

In total, there are 96 high-resolution 3D MRI volume data. For each IVD, reference manual segmentation is provided in the form of a binary mask. All images (four volumes per patient) and binary masks (one binary volume per patient) are stored in the Neuroimaging Informatics Technology Initiative (NIFTI) file format.

X-Ray datasets

NIH Database of 100,000 Chest X-Rays

This dataset contains over 112,000 Chest X-ray images from more than 30,000 unique patients

ChestX-Det-Dataset

Chest X-Ray dataset with instance-level annotations, including instance-level annotations of 13 categories of disease/abnormality of 3,578 images.

The 13 categories are Atelectasis, Calcification, Cardiomegaly, Consolidation, Diffuse Nodule, Effusion, Emphysema, Fibrosis, Fracture, Mass, Nodule, Pleural Thickening, Pneumothorax.

CheXpert

CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017.

It includes associated radiology reports.

SCR database: Segmentation in Chest Radiographs

This database included digital chest X-ray images with segmentations of lung fields, heart, and clavicles. All chest radiographs are taken from the JSRT database -  a publicly available database with 247 PA chest radiographs collected from 13 institutions in Japan and one in the United States.

The are 154 images that  contain exactly one pulmonary lung nodule each; the other 93 images contain no lung nodules. 

MURA: MSK Xrays

MURA is a dataset of musculoskeletal radiographs consisting of 14,863 studies from 12,173 patients, with a total of 40,561 multi-view radiographic images. 

Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist.

Other healthcare datasets

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. The STARE Project was conceived and initiated in 1975 by Michael Goldbaum, M.D., at the University of California, San Diego, and funded by the U.S. National Institutes of Health.

It contains 20 equal-sized (700×605) color fundus images.

Dataset aggregators

OpenNEURO

A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data. It currently offers 562 public datasets.

Kaggle

A data science community platform with tools and resources including externally contributed machine learning datasets of all kinds. To find health-related datasets, you can use a search bar with the keyword or topic you are interested in.

UCI Machine Learning Repository

One of the oldest dataset aggregators on the web. All datasets are user-contributed, and you can download them without registration. They are categorized by task, attribute, data type, and area of expertise.

💡 Pro tip: Are you ready to start annotating your data? Check out 13 Best Image Annotation Tools of 2021.

Conclusion

There's no doubt that computer vision is already revolutionizing the healthcare industry.

It's been successfully implemented across a wide spectrum of medical procedures, and the growing demand for automated data processing will only contribute to further advancements in the deep learning field.

Easy access to quality health data is thus the fundamental building block that will fuel innovation and transform the healthcare system in the years to come.

Alberto Rizzoli
V7
Alberto Rizzoli
V7

Alberto Rizzoli is the Co-Founder and CEO of V7. He is a firm believer that any task is learnable given the right training data in good quantities, and a simple architecture. He says the our closest reference to deep learning is the human sense of smell.

Related posts

Upgrade to a new era of software

We're telling the stories of teams that pioneer neural networks to solve any visual task. You can join them by signing up to V7 - the only platform to develop AIs for aony computer vision use case, and monitor them in production.You'll be able to develop your own training data and models, or apply pre-existing AI models to solve new use cases.

Learn about V7

Ready to get started?

Schedule a demo with our team or discuss your project.

Dataset Management

AutoML model training to solve visual tasks or auto-label your datasets, and a scalable inference engine to launch your project.