The Ultimate Guide to Medical Image Annotation

How to prepare and annotate medical image data? Check out this compilation of the best practices shared by leading ML teams labeling their medical data on V7.
Read time
min read  ·  
January 31, 2022
Hand X-Ray annotation using keypoints

There’s no doubt that machine learning has the power to transform the healthcare industry

The potential applications are wide-ranging and include the entirety of the medical imaging life cycle—from image creation and analysis to diagnosis and outcome prediction.

However, medical professionals are dealing with a multitude of obstacles that are preventing them from successfully implementing AI technology for clinical practice.

In this article, we will address both those problems.

Here’s what we’ll cover:

  1. What is medical image annotation?
  2. Medical image data preparation
  3. Medical imaging annotation vs. regular data annotation
  4. HIPAA Compliance
  5. Choosing the best medical image annotation tool

Let’s get started.

Build better healthcare solutions with AI

Annotate medical datasets and process patient records at scale

And if you landed here looking to roll up your sleeves and get some hands on experience annotating medical data—look no further!

You can find a suitable dataset in our Open Datasets repository and sign up for 14-day free trial to start annotating on V7.

Have a look at this quick tutorial on labeling MRI and CT images.

What is medical image annotation?

Medical image annotation is the process of labeling medical imaging data such as X-Ray, CT, MRI scans, Mammography, or Ultrasound.

It is used to train AI algorithms for medical image analysis and diagnostics, helping doctors save time, make better-informed decisions, and improve patient outcomes.

However, as you’ll soon learn, the process is not as easy as it seems. 

💡Pro tip: Curious to learn more? Check out 7 Life-Saving AI Use Cases in Healthcare.

Medical image data preparation

Limited access to medical image data is a substantial problem that explains current limitations related to the development of robust machine learning models.

Small sample sizes from small geographic areas and the time-consuming (and costly) process of data preparation create bottlenecks that result in algorithms with limited utility.

Here are a few things to keep in mind when preparing data for medical imaging annotation.

Dataset type

Your dataset needs to be representative with respect to the environment in which the model will be deployed —this will ensure the model’s accuracy.

Using images from multiple diverse datasets (e.g., different imaging machines, different populations, and medical centers) is ideal for lowering the risk of bias. Most commonly, the ratio of the training, validation, and testing data is close to 80:10:10. 

Training, validation, test data
Train, validation, test data.

After collecting your data and training your model, it’s time to use the validation set to check for overfitting or underfitting, and adjust parameters accordingly.

Finally, the model’s performance is evaluated against a testing set.

Acquiring a quality testing dataset is critical because it functions as the reference standard, and will decide on the further regulatory approval of your trained model.

💡 Pro tip: Check out our Data Annotation Tutorial: Definition, Tools, Datasets.

Dataset size

Although you can still train a relatively reliable model for specific targeted applications using smaller datasets, it is better to collect large sample sizes.

As such, the larger and more diverse your dataset is, the more accurate your model will be.

Large, relevant datasets are especially important when the differences between imaging phenotypes are subtle, or when you collect data on populations with substantial heterogeneity.

To develop generalizable ML algorithms in medical imaging, you need statistically powered data sets with millions of images.

Dataset format

Most medical imaging will be in DICOM format. 

What is DICOM? 

DICOM (Digital Imaging and Communications in Medicine ) is the standard for the communication and management of medical imaging information and related data. A DICOM file represents a case that may contain one or more images.

From a machine learning perspective, the DICOM file will be converted to another lossless image format during training; therefore, using DICOM files for AI research is not a necessity.

However, preserving the DICOM image’s integrity can be helpful in the data labeling phase, particularly as radiologists are familiar with how DICOM viewers work after operating them for years.

Multi-layer TIF files are also used. These are slices of an image—often from microscopy—and notoriously sparsely supported. Much like other TIF files, the acronym is jokingly referred to as "Thousands of Incompatible Formats.” While V7 supports most TIF files, we routinely encounter new versions to include support for.

Finally, some research will use ultra-high-resolution images that require tiling, such as Leica or Aperio's SVS.

These are often used in pathology. While many viewers support these high-resolution images, which may exceed 100,000 pixels squared and several gigabytes, very few allow you to add any markups or annotations on them that deep learning frameworks can read.

💡 Pro tip: Read The Essential Guide to Neural Network Architectures.

Medical image annotation VS Regular data annotation

If your ultimate goal is to train machine learning models, there are a few differences between annotating a medical image versus a regular PNG or JPEG. 

Here are a few things to consider about medical imaging that do not apply to other vision data.

Medical image annotation VS Regular data annotation
Medical image annotation VS Regular data annotation

Let’s explore some of them in more detail.

Limited access to medical image data

“Garbage in, garbage out” is a popular machine learning quip that highlights how important the quality of the data is when training ML models.

You need quality data to build clinically meaningful models.


Access to medical imaging data through the picture archiving and communication system (PACS) is restricted to accredited medical professionals, and obtaining all legal documents and permissions is very time-consuming.

Additionally, most healthcare institutions don’t have the proper infrastructure to share large amounts of medical images.

Finally, collected data often requires anonymization (de-identification), which further complicates the whole process.

HIPAA, FDA and CE Compliance

Image datasets used in a clinical environment need to have an accurate history of who took part in developing which annotation. 

Annotation authorship, dataset integrity, and a history of data reviews are required for regulatory approval.

The US FDA and European CE provide guidelines on how datasets should appear when developing models for clinical diagnostics. Working on a platform that already covers those guidelines is a good start. 

The other part involves ensuring that the right data processor agreements are in place with whoever will host, process, and perform the annotations.

💡 Pro tip: Check out V7 is now FDA Part 11 Compliant announcement to learn more.

Medical imaging contains transparencies

What we mean by 'medical imaging contains transparencies' is that occlusions must be treated differently 

Objects in front of one another may appear behind one another. It's no secret that AI handles occlusion poorly due to a lack of presence of mind, and transparent objects can be even worse. 

Luckily, though, an organ, cell or bone appearing transparent is far more obvious to an AI than a pane of glass.

See the chest X-Ray below and decide for yourself—are the lungs behind or in front of the diaphragm? 

Chest X-ray

A chest x-ray displaying the lower portion of the lungs, extending in front of the diaphragm posteriorly and behind it anteriorly

The answer is … both! Traditional computer vision methods cannot perceive the occluded portion of the lungs; however, a deep neural network can easily learn to spot it.

💡 Pro tip: Learn about Annotating With Bounding Boxes: Quality Best Practices.

Differing views and volumes

A case may contain 2D or 3D imaging. 

In both examples, more than one view is often necessary to assess what's happening. For example, the X-Ray of a hand may only reveal a fracture when the hand is in a certain pose or angle. 

Nonetheless, it is standard to capture a frontal view of the hand.

Frontal view of the hand
Frontal view of the hand.

A small fracture at the 3rd and 4th middle phalanx base is mostly only visible on the right image.

It's important not to include un-usable data in your machine learning dataset.

If a view such as the one above is useful for reference purposes but cannot be labeled and turned into training data, it's best to discard it.

Similarly, volumetric data such as MRI, CT, or OCT can be browsed by the sagittal, coronal, or axial planes. 

For browsing and reference purposes, these are useful as they give a better sense of anatomy. From a machine learning perspective, unless you wish for a model to process all three planar views, it's best to stick to one and reconstruct those annotations in the other two planes. 


It provides more consistent results across cases. 

For example, a team of 10 annotator radiologists labeling 100 brain CT cases axially, and another team of 10 labeling another 100 cases sagittally, will achieve slightly differing results. These can introduce bias to your model and lead to both plane modalities performing worse than if the team had consistently applied labels to one series.

MRI Slices Annotated
The annotated slices of a CT and MRI scan of a head.

HIPAA Compliance

HIPAA guidelines are not something to take lightly. 

When searching for a platform to process your data, ensure it provides a clear answer to the following:

  1. Is data encrypted at transfer and at rest?
  2. Where is the data storage geographically located, and in which circumstances would this change?
  3. In what form does the image information reach the client, and are any measures taken to prevent them from exporting it?
  4. What user access restrictions are in place to prevent unauthorized access to the data, and is there a user management system?
  5. Are there any measures to prevent users from leaving their devices connected to the service, resulting in security issues?
  6. What information is logged during usage?

The six questions above are basic technical compliance requirements.

💡Pro tip: Learn more about V7 Dataset Management and Data Security.

Below are a few data-access-related ones, which you will want to pay attention to when adding users to your training data platform:

  1. Does the facility where the annotators label the data have means of access control?
  2. Do the annotators with access to electronic patient health information (ePHI) access their mobile devices while working?
  3. Do you maintain a detailed inventory of all hardware involved in this project and a record of its movement?
  4. Are human annotators trained for, and aware of, these requirements?

A good reference point to start with for understanding HIPAA requirements is a checklist like this one.

Often the strictness of data security requirements scales with the size of your project, and the fines for infringing HIPAA requirements are high. 

If HIPAA compliance is a requirement for your project, it's always recommended to have a professional legal audit of the firm you are working with to ensure everything is in place. The last thing you want is for the company handling all of your data to incur enormous penalties due to a small security oversight.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Choosing the best medical image annotation tool

Radiologists annotate (or markup) medical images on a daily basis. 

This can be done in DICOM viewers, which contain basic annotation capabilities such as bounding boxes, arrows, and sometimes polygons. 

Machine learning (ML) may sometimes leverage these labels, however, their format is often inconsistent with the needs of ML research, such as lack of instance IDs, attributes, a labeling queue, or the correct formats for deep learning frameworks like Pytorch or TensorFlow. 

The automated segmentation of a lung in a lateral thoracic X-Ray
The automated image segmentation of a lung in a lateral thoracic X-Ray

For example, you can't develop a neural network analyzing pulmonary fibrosis from radiologist DICOM markup. Instead, you will have to carefully label slices using a professional tool.

💡 Pro tip: Ready to train your models? Have a look at Mean Average Precision (mAP) Explained: Everything You Need to Know.

Here are a few questions you should ask when choosing the medical image annotation software.

  • Does the tool support DICOM?
  • Is it HIPAA compliant?
  • Does it come with data labeling services for medical image annotation?
  • What annotation types does it offer?
  • Does it support video/time series?
  • Is it possible to export data to deep learning frameworks like Pytorch or TensorFlow?

It's always good to start to partner up with a company that has already invested the time and effort required to comply with the various data formats, regulatory requirements, and user experience needed for a successful medical AI project.

V7 is one of them. 

If you’d like to discuss your medical data annotation project, don’t hesitate to schedule a call or send us an email today.

💡 Read more:

3 Signs You Are Ready to Annotate Data for Machine Learning

6 Innovative Artificial Intelligence Applications in Dentistry

Computer Vision: Everything You Need to Know

A Comprehensive Guide to Human Pose Estimation

The Complete Guide to Panoptic Segmentation [+V7 Tutorial]

9 Reinforcement Learning Real-Life Applications

Mean Average Precision (mAP) Explained: Everything You Need to Know

The Beginner’s Guide to Contrastive Learning

Previously CEO at Aipoly - First smartphone engine for convolutional neural networks. Management & Stats grad at Cass Business School and Singularity University. Never had a real job.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product