Labeling edge cases for machine learning can be challenging - and even experts may disagree, resulting in inaccuracies in annotations that negatively affect your model’s performance.
To help our users create error-free training data more efficiently, we are thrilled to announce the introduction of the Consensus Stage. This new feature:
The Consensus Stage measures the degree of agreement (overlap of the annotation area) among multiple annotators and/or models and highlights any areas of disagreement. This enables the selection of the best possible annotation and reduces the amount of time required for manual review.
The Consensus Stage is a crucial part of the ML data annotation workflow that helps ensure the accuracy, consistency, and quality of the annotated dataset. The benefits of implementing a Consensus Stage include:
Consensus helps to reduce bias in the annotation process, as it provides an opportunity to identify and correct any inconsistencies introduced by individual annotators. This ensures that the final annotations are accurate across all data instances. You can compare the annotator-to- annotator, model-to-model, and annotator-to-model performance.
While consensus can add an extra step to the annotation process, it can actually help to speed up the overall process by reducing the number of individual annotations that need to be reviewed. You can take advantage of the model vs annotator setup to reduce the annotation time and cost (using a model instead of having two human annotators labeling data in parallel) and getting to agreement faster with fewer items going through the review stage.
Consensus can improve model performance by providing a more reliable and accurate training dataset, which can lead to better predictions and results. As V7 is FDA and HIPAA compliant, this will allow V7 users to create FDA-approved models faster and with more confidence.
The Consensus stage offers a comprehensive data overview. You can compare annotator performance and spot inconsistencies, add comments, and resolve disagreements. This process helps identify areas that require further training or guidance.
By incorporating a Consensus stage into the workflow, V7 allows for custom rules to be created to handle conflicting annotations, such as requiring a minimum number of annotators to agree on an annotation, or assigning more weight to certain annotators based on their expertise or performance. This level of customization can help to tailor the workflow to the specific needs and requirements of a particular project or dataset.
Setting up a Consensus stage on V7 is easy and extremely intuitive. Here’s a step-by-step tutorial to help you navigate this process seamlessly.
You can set up the Overlap threshold on 3 different levels:
a) General (Default) - globally for all annotations.
Example: When set to 80%, it means that if any given two annotations have 80% overlap, there’s an agreement, and one of them (Champion) automatically moves to the next stage in your workflow.
b) Annotation Types - based on the annotation types present in your data. It overwrites the “General” threshold, meaning that you can set up individual overlap values for every annotation type within your data.
Example: If a Polygon threshold is set to 80% and Tag threshold is set to 60%, the image where both annotators created a segmentation mask with 80% overlap, but added different tags (less than 60% overlap), the image will end up with a Disagreement that will need to be resolved in the review stage.
Note: Tags use a voting system, not IoU. It means that if you set the threshold to 60%, you would need 3 out of 5 (60%), or 2/3 (67%) (or some other >60% ratio), of annotators to add that tag for it to automatically go into Agreement path.
c) Classes - based on individual classes within your data. It overwrites the “General” and “Types” thresholds.
Example: In the case where there are 3 objects to be annotated in one image: Left lung, Right lung, Pneumonia, the failure to achieve the specified overlap threshold for any one of the individual classes will result in the disagreement, and the image being routed to the review stage.
V7 supports automated disagreement checking for:
V7 allows you to use the consensus stage for blind parallel reads (without automated disagreement checking) for the following classes:
After all annotators and/or models have added their annotations, the disagreements can be reviewed in a review stage. Head over to the Datasets tab to access files that were sent to the review stage.
Once you are in the workview, you’ll notice the annotations created by both parties and the IoU (Intersection Over Union) value, indicating the degree of the overlap between the annotations.
You can add polylines, elipses, keypoints & keypoints skeletons to a parallel annotation stage in consensus, but V7 currently does not automatically surface the disagreements in the image.
The reviewer needs to manually scan the images to identify the inconsistencies and move the file to the next stage.
V7's consensus stage is compatible with images, videos, DICOMs and PDFs, for any number of annotators and/or models.
The Consensus stage can be used in a plethora of use cases - especially in medical data workflows. Below are a couple of examples of real-life projects where the consensus stage could be applied to support the creation of high-quality training data more efficiently.
Use case: Resolving disagreements between experts & improving classes understanding
Data accuracy is particularly critical for healthcare AI teams, and the FDA requires consensus mechanisms in medical product development workflows. In this case, we’ll look at the annotations created on the dental dataset where two experts disagreed.
Here’s the workflow used for this example.
The overlap threshold was specified to be 80% across all classes, and Annotator 1 was set as our Champion - meaning that in case of agreement, the annotations made by Diana Annotator will be passed onto the Complete stage.
After Paul Annotator and Diana Annotator completed all the annotations, a few images ended up in a review stage. Below is an example of an image with identified disagreements between two experts.
As you can see, the image contains two types of disagreements.
Each disagreement needs to be reviewed and resolved by the reviewer either by accepting or discarding the chosen annotation so that it can move to the next stage (in case of Accepted annotations → Complete, and Rejected annotation → Annotate).
You can also add comments about the labelings issues and misunderstandings of classes to guide your annotators towards improving the quality of their annotations in the future.
Use case: Active learning & spotting human-made errors
The Consensus stage enables you to compare the performance of annotators and computer vision models and conduct quality checks on the created annotations. By doing so, you can monitor the consistency and accuracy of labeled data and detect potential errors or biases.
This set up is also useful for understanding where your model is struggling and what kind of data you should collect to re-train it and improve its performance.
Here’s the example of the workflow with model and human in the Consensus stage.
Use case: Comparing independently trained models
You can also use the Consensus stage to compare the performance of two models - for example trained using different architectures. This will allow you to surface any disagreements and highlight data where models are struggling.
In the example below you can see two distinct Document Processing Models. Running a consensus stage makes it easier to understand visually how they differ and what are the strengths and weaknesses of each one.