Here’s some food for thought—
Image data accounts for about 90 percent of all healthcare input data.
It creates a multitude of opportunities for training computer vision algorithms to improve diagnostic accuracy, enhance care delivery, or automate medical records management.
Medical data is often fragmented, messy, and hard to access. It might take you hours to find relevant datasets.
Hence, we’ve curated a list of open-source healthcare datasets that you can use for medical imagining annotation.
Here’s what we’ll cover:
P.S. We will regularly update this list, so feel free to suggest the datasets you are using and we will make sure to add them.
Manage your datasets, annotate data, and train models 10x faster.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
A free online Medical Image Database with over 59,000 indexed and curated images from over 12,000 patients.
TCIA is a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download.
The data is organized as “collections”—typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc), or research focus.
DICOM is the primary file format used by TCIA for radiology imaging.
There’s also supporting data related to the images, such as patient outcomes, treatment details, genomics, and expert analyses.
Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It was launched in 2012 and funded by the German Research Foundation (DFG).
Re3Data contains data from over 2000 research subjects defined across several broad categories.
This dataset contains 6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 amongst these.
Each image contains:
Lung annotations are polygons following pixel-level boundaries. They can be exported as COCO, VOC, or Darwin JSON formats.
It is a dataset with 137 cleaned images of COVID-19 and 317 images in total containing Viral Pneumonia and Normal Chest X-Rays structured into the test and train directories.
It is a small dataset with 20 CT scans and expert segmentations of patients with COVID-19.
This dataset is a small subset of images from the cancer imaging archive.
It consists of the middle slice of all CT images with age, modality, and contrast tags.This results in 475 series from 69 different patients.
It is of the largest image sets currently available. It contains CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. Deep Lesion includes over 32,000 lesions from 4000 unique patients.
The current database contains a limited number of annotated CT image scans highlighting many of the key issues in measuring large lesions in the lung.
All images are freely available for download.
It contains two public image datasets with lung CT images in the DICOM format together with documentation of abnormalities by radiologists.
The Open Access Series of Imaging Studies (OASIS) aims to make MRI data sets of the brain freely available to the scientific community.
It provides access to a database of neuroimaging and processed imaging data across a broad demographic, cognitive, and genetic spectrum for use in neuroimaging, clinical, and cognitive research on normal aging and cognitive decline.
The database currently contains three separate datasets: OASIS-1, OASIS-2, and OASIS-3.
The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center.
The dataset contains 1,104 abnormal exams, with 319 ACL tears and 508 meniscal tears. All the labels were obtained through manual extraction from clinical reports.
It contains 24 3D multi-modality MRI data sets of at least 7 IVDs of the lower spine, collected from 12 subjects in two different stages in a study investigating the effect of prolonged bed rest (spaceflight simulation) on the lumbar intervertebral discs.
In total, there are 96 high-resolution 3D MRI volume data. For each IVD, reference manual segmentation is provided in the form of a binary mask. All images (four volumes per patient) and binary masks (one binary volume per patient) are stored in the Neuroimaging Informatics Technology Initiative (NIFTI) file format.
This dataset contains over 112,000 Chest X-ray images from more than 30,000 unique patients
Chest X-Ray dataset with instance-level annotations, including instance-level annotations of 13 categories of disease/abnormality of 3,578 images.
The 13 categories are Atelectasis, Calcification, Cardiomegaly, Consolidation, Diffuse Nodule, Effusion, Emphysema, Fibrosis, Fracture, Mass, Nodule, Pleural Thickening, Pneumothorax.
CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017.
It includes associated radiology reports.
This database included digital chest X-ray images with segmentations of lung fields, heart, and clavicles. All chest radiographs are taken from the JSRT database - a publicly available database with 247 PA chest radiographs collected from 13 institutions in Japan and one in the United States.
The are 154 images that contain exactly one pulmonary lung nodule each; the other 93 images contain no lung nodules.
MURA is a dataset of musculoskeletal radiographs consisting of 14,863 studies from 12,173 patients, with a total of 40,561 multi-view radiographic images.
Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist.
The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. The STARE Project was conceived and initiated in 1975 by Michael Goldbaum, M.D., at the University of California, San Diego, and funded by the U.S. National Institutes of Health.
It contains 20 equal-sized (700×605) color fundus images.
A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data. It currently offers 562 public datasets.
A data science community platform with tools and resources including externally contributed machine learning datasets of all kinds. To find health-related datasets, you can use a search bar with the keyword or topic you are interested in.
One of the oldest dataset aggregators on the web. All datasets are user-contributed, and you can download them without registration. They are categorized by task, attribute, data type, and area of expertise.
It's been successfully implemented across a wide spectrum of medical procedures, and the growing demand for automated data processing will only contribute to further advancements in the deep learning field.
Easy access to quality health data is the fundamental building block that will fuel innovation and transform the healthcare system in the years to come.
“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Building AI products? This guide breaks down the A to Z of delivering an AI success story.