Creating datasets for diagnostic usage requires uncommon file support, and a series of features required to maintain accountability of annotations.
Radiologists annotate (or markup) medical images on a daily basis. This can be done in DICOM viewers, which contain basic annotation capabilities such as bounding boxes, arrows, and sometimes polygons. Machine learning (ML) may sometimes leverage these labels, however their format is often inconsistent with the needs of ML research, such as lack of instance IDs, attributes, a labeling queue, or the correct formats for deep learning frameworks like Pytorch or TensorFlow. For example, you can't develop a neural network analyzing pulmonary fibrosis from radiologist DICOM markup, you will instead have to carefully label slices using a professional tool.
The automated segmentation of a lung in a lateral thoracic x-ray
If your ultimate goal is to train machine learning models, there are few differences between annotating a medical image versus a regular PNG or JPEG. Here are a few things to consider about medical imaging that do not apply in other vision data:
This means occlusions must be treated differently. Objects in front of one another may appear behind one another. It's no secret that AI handles occlusion poorly due to a lack of presence of mind, and transparent objects can be even worse. Luckily though, an organ, cell, or bone appearing transparent is far more obvious to an AI than a pane of glass.
See the chest x-ray below: are the lungs behind or in front of the diaphragm? The answer is both. The occluded portion of the lungs cannot be perceived by traditional computer vision methods, however a deep neural network can easily learn to spot it.
A chest x-ray displaying the lower portion of the lungs, extending in front of the diaphragm posteriorly and behind it anteriorly
Most medical imaging will be in DICOM format. A DICOM file represents a case, which may contain one or more images.
From a machine learning perspective, the DICOM file will be converted to another lossless image format during training, therefore it's not a necessity to use DICOM files for AI research. However preserving the integrity of the DICOM image can be useful for the annotation phase, particularly as radiologists are familiar with how DICOM viewers work after operating them for years.
Multi-layer TIF files are also used. These are slices of an image, often from microscopy, and notoriously sparsely supported. Much like other TIF files, the acronym is jokingly mis-referred to as "Thousands of Incompatible Formats", and whilst V7 Darwin does support most TIF files, we routinely encounter new versions to include support for.
Finally, some research will make sure of ultra-high resolution images that require tiling, such as Leica or Aperio's SVS. These are often used in pathology, and whilst there are many viewers that support these high resolution images, which may exceed 100,000 pixels squared and several gigabytes, very few allow you to add any markups or annotations on them which can be read by deep learning frameworks.
A case may contain 2D or 3D imaging. In both examples, often more than one view is necessary to assess what's happening. For example, the x-ray of a hand may only reveal a fracture when the hand is in certain pose or angle. Nonetheless it is standard to capture a frontal view of the hand anyways:
A small fracture at the base of the 3rd and 4th middle phalanx is mostly only visible on the right image.
It's important not to include un-usable data in your machine learning dataset. If a view such as the one above is useful for reference purposes, but cannot be labelled and turned into training data, it's best to discard it.
In a similar way, volumetric data such as MRI, CT, or OCT can be browsed by the sagittal, coronal, or axial planes. For browsing and reference purposes, these are useful as they give a better sense of anatomy. From a machine learning perspective, unless you wish for a model to process all three planar views, it's best to stick to one, and reconstruct those annotations in the other two planes. This provides more consistent results across cases - for example, a team of 10 annotator radiologists labeling 100 brain CT cases axially, and another team of 10 labeling another 100 cases sagittally, will achieve slightly differing results. These can introduce bias to your model, and lead to both plane modalities performing worse than if the team had consistently applied labels to one series.
The annotated slices of a CT and MRI scan of a head
Image datasets used in a clinical environment need to have an accurate history of who took part in developing which annotation. This form of accountability is mostly integrated within V7 Darwin available on all annotation types on all image formats.
The US FDA and European CE provide guidelines on how datasets should appear when developing models for clinical diagnostics. Starting your work on a platform that already covers those guidelines is a good start. The other part will involve ensuring that the right data processor agreements are in place with whoever will host, process, and perform the annotations.
HIPAA guidelines are not something to take lightly. When searching for a platform to process your data, ensure it provides a clear answer to the following:
These 6 above are basic technical compliance requirements.
Below are a few data-access related ones, which you will want to pay attention to when adding users to your training data platform:
A good reference point to start with for understanding HIPAA requirements is a checklist like this one.
The fines for infringing HIPAA requirements are high. If HIPAA compliance is a requirement for your project, it's always recommended to have a legal processional audit the firm you are working with to ensure everything is in place. The last thing you will want is for the company handling all of your data to suddenly incur enormous penalties due to a small security oversight.
To learn more about how V7 protects user data check out our short data security statement. If you'd like to discuss a project that requires HIPAA compliance do get in touch.
Often the strictness of data security requirements scales with the size of your project, however it's always good to start with a partner that has already invested the time and effort required to comply with the various data formats, regulatory requirements, and user experience needed for a successful medical AI project.