The Definitive Guide to Instance Segmentation [+V7 Tutorial]

What is Instance Segmentation, how does it work and what are its real-life applications? Read on to learn how to train your own Instance Segmentation model on V7 in less than an hour!
Read time
min read  ·  
February 22, 2022
Instance segmentation of cats and dogs

Segmentation refers to the task of segregating objects in a complex visual environment and is an important area of computer vision research. 

Instance Segmentation is a special form of image segmentation that deals with detecting instances of objects and demarcating their boundaries. It finds large-scale applicability in real-world scenarios like self-driving cars, medical imagining, aerial crop monitoring, and more.

We find Instance Segmentation to be particularly useful when distinct objects of a similar type are present and need to be monitored separately.   

Here’s what we’ll cover:

  1. What is Instance Segmentation?
  2. How does Instance Segmentation work?
  3. Instance Segmentation applications
  4. How to train an Instance Segmentation Model [V7 Tutorial]
Speed up your ML data labeling

Annotate your video and image datasets 10x faster

Ready to streamline AI product deployment right away? Check out:

What is Instance Segmentation?

Instance Segmentation is the technique of detecting, segmenting, and classifying every individual object in an image.

Instance segmentation performed on cats and dogs

We can refer to Instance Segmentation as a combination of semantic segmentation and object detection (detecting all instances of a category in an image) with the additional feature of demarcating separate instances of any particular segment class added to the vanilla segmentation task.

Instance Segmentation produces a richer output format as compared to both object detection and semantic segmentation networks. 

💡 Pro tip: Curious to learn more about neural networks? Check out The Essential Guide to Neural Network Architectures.

Whereas an object detection system coarsely localizes multiple objects with bounding boxes and a semantic segmentation framework produces pixel-level category labels for each category class, Instance Segmentation produces a segment map of each category as well as each instance of a particular class—therefore, providing a more meaningful inference on an image.

Let’s consider an image below with cats and dogs.

Semantic segmentation and Instance segmentation on cats and a dog.

Semantic Segmentation can mark out the dog and cats’ pixels, however, there is no indication of how many dogs and cats are there in the image.

With Instance Segmentation, one can find the bounding boxes of each instance (which in this case pertains to a dog and two cats) as well as the object segmentation maps for each instance, thereby knowing the number of instances (cats and a dog) in the image.

💡 Pro tip: Looking to annotate with bounding boxes? Check out 9 Essential Features for a Bounding Box Annotation Tool.

Instance segmentation vs. Semantic Segmentation vs. Panoptic Segmentation

To understand the difference between three kinds of segmentation—Instance Segmentation, Semantic Segmentation, and Panoptic Segmentation, we need to define two different categories of objects—stuff and things.

Stuff defines categories, which cannot be counted such as the sky and the road.

Things are the actual objects in the image—they can be counted by assigning different instance IDs to each one of them.

Semantic Segmentation takes the input image and marks every pixel in the image to category class. Thus all instances of a particular category receive the same label.

In Instance Segmentation, bounding boxes are generated for each instance of multiple categories present along with the object segmentation masks. It treats multiple objects of the same class as distinct instances.

Panoptic segmentation uses an algorithm to understand the difference between amorphous stuff and countable things like cars and persons. Panoptic segmentation uses the same backbone as that of Mask-RCNN used for Instance Segmentation with an extra panoptic head responsible for producing the final outputs by processing the semantic and instance results. It is completely parameter-free and requires no training. 

Therefore, the panoptic head is used to differentiate between things and stuff. 

For pixels with no instance or thing output, it is considered to be a class pertaining to stuff. For pixels both having stuff and thing labels, Softmax is used to determine if it should be labeled as that instance or semantic values. 


Additionally, the authors have also made support for logits for the ‘unknown’ class in order to avoid making wrong predictions. If for any pixel the maximum logit for the ‘thing’ class from the semantic head is larger than the maximum of the logit from the instance head, then it is likely that those pixels correspond to missing instances and are labeled as unknown.

Semantic segmentation vs. Instance Segmentation vs. Panoptic Segmentation done on people at the seaside.

How does Instance Segmentation work?

Instance Segmentation is a challenging task and requires the detection of multiple instances of different objects present in an image along with their per-pixel segmentation mask. 

Instance Segmentation methods can be both R-CNN driven or FCN driven. 

Instance segmentation methods

FCNs (Fully Convolutional Networks) have been widely used for Semantic Segmentation. 

Although, convolutional networks being translation-invariant, cannot be used for Instance Segmentation, which requires the detection and segmentation of individual object instances, the same image pixel receives the same responses (thus classification scores) irrespective to it’s relative position in the context. 

In order for Instance Segmentation to work, Semantic Segmentation needs to operate on region level and the same pixel can have different semantics in different regions—this cannot be modeled by a single FCN on a single image. 

In conventional FCNs, a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some category”. 

💡 Pro tip: Read Image Classification Explained: An Introduction [+V7 Tutorial].

We use k2 position-sensitive score maps that respond to k x k evenly partitioned cells of the object to introduce translation variant property. Each score now represents the likelihood of ”the pixel belonging to some object instance at a relative position”. 

For example, a pixel corresponding to a person can be a foreground for a person in one semantic position and a background for another person (instance). Now, object detection and segmentation are carried out jointly and simultaneously. 

Mask R-CNN is the state-of-the-art model for Instance Segmentation with three outputs. It has a class label and a bounding box offset, which is similar to that of Faster R-CNN, and a third branch that outputs the object mask requiring extraction of a much finer spatial layout of an object. 

Mask R-CNN Architecture

Unlike class labels or box offsets, which inevitably get collapsed into short output vectors by fully connected layers, the predicted masks from each ROI are able to maintain object spatial layout without collapsing into a vector representation that lacks spatial dimensions. 

Models like Faster R-CNN used ROIPool for extracting small feature maps from each ROI and to achieve these quantizations are performed—on a continuous coordinate x by computing for e.g. [x / 16] where 16 is the feature map size and [.] performs rounding operation. 

💡 Pro tip: Ready to train your models? Have a look at Mean Average Precision (mAP) Explained: Everything You Need to Know.

These quantizations introduce misalignments between the ROI and the extracted features. To curb these effects, Mask R-CNN uses the ROI Align layer which avoids quantization (for e.g, x/16 is done instead of [x / 16] and bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each ROI bin. 

An ROI is considered positive if it overlaps enough with the ground truth bounding box, which is analyzed by the mask loss and is only defined for positive ROIs or ones which have enough overlapping with the ground truth. 

In the case of architecture, ResNet has been the choice for backbone architecture in both R-CNN and FCN driven approaches. 

The extracted features are usually taken from the final convolutional layer of ResNet-50. For extracting ROI features, it uses an FPN (Feature Pyramid Network) which uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single–scale input. Anchors of a single scale are assigned to each level, formally anchors are defined to have areas of {322, 642, ..} pixels on {P2, P3, …} respectively. Anchors of multiple aspect ratios are also used at each level. 

Training labels are assigned to the anchors based on Intersection-over-Union (IoU) ratio. A higher positive IoU is given a positive label and vice-versa. After region proposals, Non-Max Suppression is used to remove the bounding boxes whose IoU is less than the threshold. 

💡 Pro tip: Looking for a cost-free way to label your data? Check out The Complete Guide to CVAT—Pros & Cons.

Instance Segmentation applications

Here are some of the most prominent applications of Instance Segmentation.

Self-driving cars

For a self-driving car with complex street scenarios such as a construction site or a very crowded street with a lot of pedestrians, it should have a detailed understanding of its surroundings. 

Such fine-grained results can be achieved by segmenting image content with pixel-level accuracy, This approach can be done by Panoptic Segmentation which is a mix of Semantic Segmentation (differentiating sky, road, pedestrian, and other cars) and Instance Segmentation (differentiating different instances of the same category). 

It can also be used in conjunction with dense (pixel level) distance-to-object estimation methods to help enable high-resolution 3-D depth estimation of a scene. 

Medical scans

There is also a wide variety of applications of Instance Segmentation in the medical domain. 

In histopathologic images which are usually whole slide images containing a large number of nuclei of various shapes surrounded by cytoplasm. Instance Segmentation plays an important role to detect and segment nuclei which can be further processed for the detection of dangerous diseases like cancer. 

It is also used for detecting tumors in MRI scans of the brain. Semantic Segmentation is also being widely used in segmenting multiple organs on laparoscopic surgery and segmenting cataract surgery instruments.

Satellite imagery

In satellite imagery, the size of the objects is usually quite small, and performing pixel-wise is not very efficient due to the close placement of the objects relative to the resolution of the image. Therefore, to treat each object as a separate instance, we can use a network architecture performing instance segmentation to achieve a better separation between objects.

Some major areas where instance segmentation is used on satellite images include the detection and counting of cars, ships detection for maritime security, oil discharge control, and sea pollution monitoring, and segmentation of buildings are used to make geospatial analysis.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

How to train an Instance Segmentation model on V7

Finally, the fun part! ;-)

Let us show you how you can train your own Instance Segmentation model on V7.

To get started, sign up or request a 14-day free trial, and then follow this short tutorial.

1. Upload data

Instance segmentation models are supervised deep learning networks and they need data to train on. In order to start training our instance segmentation network, you need create a new dataset and start uploading our images.

V7 datasets interface

Apart from the web based GUI, V7 allows you to upload using the CLI and the API too. Check out V7 Dataset Management to learn how you can organize and manage your training data on V7.

💡 Pro tip: Have a look at our lists of 65+ Best Free Datasets for Machine Learning and 20+ Open Source Computer Vision Datasets to find quality data.

2. Create new annotation classes

Next, we need to create new classes that we will annotate.

For the sake of this article, we'll show you how to train a model to segment pedestrians on a busy road—a task that is a must in self-driving cars. To begin with, we will create two classes: Pedestrian and Car.

These classes should be created under the Polygon tool of V7 so that we can use a powerful auto-annotate tool to create segment maps faster.

New annotation classes creation on V7

Don't forget to also choose the "Instance ID" as subtype so that each object that you annotate will become a seperate instance.

3. Auto-annotate your images or videos

Once you've created your polygon classes, it's time to auto-annotate your data!

V7 Auto-Annotate tool takes advantage of a deep learning model to automatically segment items and create pixel-perfect polygon masks.

Pedestrian automated-annotation on V7

Label all relevant objects in your images or videos.

💡 Pro tip: Have a look at V7 Annotation to get a better understanding of V7's funcionalities.

4. Check for class balance

Check for class imbalance in the dataset by going to the overview panel that gives you the information on how many instances of each class you have created and their representations in the entire dataset. As you can see in this dataset, we have only the passenger class as the balanced one.

Other classes are either overrepresented or underrepresented. We can fix this issue by uploading and labeling more images of the underrepresented classes and thus mitigate class imbalance.

Class distribution panel in V7

5. Train your model

Create a new model with instance segmentation as the task and give it a name. Select the classes you want to train on and start the training procedure.

Picking an instance segmentation model to train on V7

Easy, right?!

You can re-train your model with more annotated data until it achieves the desired performance.

Instance Segmentation: Key Takeaways

Instance segmentation is an important domain of Computer Vision, offering a combination of object detection and semantic segmentation tasks. 

It finds large-scale applicability in the healthcare and autonomous automotive industry, thereby intricately affecting our day to day life. As a rapidly advancing domain of research, instance segmentation is being increasingly used in novel areas like satellite imagery and retail where the application of vision has been severely limited till recently. 

💡 Read more:

13 Best Image Annotation Tools

The Beginner's Guide to Self-Supervised Learning

What is Overfitting in Deep Learning and How to Avoid It

Overfitting vs. Underfitting: What's the Difference?

The Complete Guide to Ensemble Learning

The Ultimate Guide to Semi-Supervised Learning

9 Reinforcement Learning Real-Life Applications

Mean Average Precision (mAP) Explained: Everything You Need to Know

The Beginner’s Guide to Contrastive Learning

Hmrishav Bandyopadhyay studies Electronics and Telecommunication Engineering at Jadavpur University. He previously worked as a researcher at the University of California, Irvine, and Carnegie Mellon Univeristy. His deep learning research revolves around unsupervised image de-warping and segmentation.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product