In computer vision, the task of image segmentation enables machines to separate different objects in the image into individual segments.
It achieves this by assigning labels to every pixel belonging to the same class. Furthermore, it transforms an image into something that is easier to analyze and understand. The task of image segmentation usually involves classifying, detecting, and labeling objects.
Image segmentation can be classified into three categories:
If you’re feeling a bit lost trying to grasp all those concepts—worry not!
We’ve put together Semantic Segmentation and Instance Segmentation beginner guides that you can check out to get up to speed. In this article, we’ll deal with the topic of Panoptic Segmentation and its most prominent applications.
Here’s what we’ll cover:
Solve any video or image labeling task 10x faster and with 10x less manual work.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
And in case you landed here to get hands-on experience with doing your own computer vision project, make sure to check out:
And if you are looking for some inspiration, have a look at our list of 27+ Most Popular Computer Vision Applications and Use Cases in 2022.
Now, let’s dive in.
The word panoptic is derived from two words: pan and optic.
Pan means “all” and optic means “vision”. Panoptic segmentation, therefore, roughly means “everything visible in a given visual field”.
In computer vision, the task of panoptic segmentation can be broken down into three simple steps:
This sounds exactly the same as the two other image segmentation techniques we’ve mentioned above. So, what’s the caveat?
You see, panoptic segmentation is a hybrid method combining semantic segmentation and instance segmentation.
It was introduced by Alexander Kirillov and his team in 2018.
The goal of panoptic segmentation is to holistically generalize the task of image segmentation rather than using two different approaches separately or, as the authors defined it, “the unified or global view of segmentation”.
The key differentiator?
Panoptic segmentation helps classify objects into two categories: things and stuff.
In computer vision, the term things generally refers to objects that have properly defined geometry and are countable, like a person, cars, animals, etc.
Stuff is the term used to define objects that don’t have proper geometry but are heavily identified by the texture and material like the sky, road, water bodies, etc.
You can skip this section if you’ve already mastered Semantic Segmentation and Instance Segmentation.
For those of you who need a super quick recap of the differences between those three image segmentation methods, here are the answers you’ve been looking for!
Semantic Segmentation is the task of assigning a class label to every pixel in the image. Essentially, the task of Semantic Segmentation can be referred to as classifying a certain class of image and separating it from the rest of the image classes by overlaying it with a segmentation mask.
Instance segmentation, on the other hand, creates separate segmentation masks for all objects and classifies pixels into categories on the basis of individual “instances” rather than classes.
Finally, here’s a short and sweet summary of the key differences between Semantic and Instance segmentation.
Panoptic segmentation combines both—it identifies the objects with respect to class labels and also identifies all the instances in the given image.
In panoptic segmentation, the input image is fed into two networks: a fully convolutional network (FCN) and Mask R-CNN.
The FCN is responsible for capturing patterns from the uncountable objects—stuff – and it yields semantic segmentations.
The FCN uses skip connections that enable it to reconstruct accurate segmentation boundaries. Also, skip connections enable the model to make local predictions that accurately define the global or the overall structure of the object.
Likewise, the Mask R-CNN is responsible for capturing patterns of the objects that are countable—things—and it yields instance segmentations. It consists of two stages:
The output of both models is then combined to get a more general output.
However, this approach has several drawbacks such as:
To address these issues, a new architecture called the Efficient Panoptic Segmentation or EfficientPS was proposed, which improves both the efficiency and the performance.
On the most basic level, EfficientPS uses a shared backbone built on the architecture called the EfficientNet.
The architecture consists of:
Here’s the visual representation of EfficientNet.
The EfficientPS network is represented in red, while the two-way Feature Pyramid Network (FPN) is represented in purple, blue and green. The network for semantic and instance segmentation is represented in yellow and orange, respectively, while the fusion block is represented at the end.
And here’s an example of how it works in practice—
The image is fed into the shared backbone, which is an encoder of the EfficientNet. This encoder is coupled with a two-way FPN that extracts a rich representation of information and fuses multi-scale features much more effectively.
The output from the EfficientNet is then fed into two heads in parallel: one for semantic segmentation and the other for instance segmentation.
The semantic head consists of three different modules, which enable it to capture fine features, along with long-range contextual dependencies, and improve object boundary refinement. This, in turn, allows it to separate different objects from each other with a high level of precision.
The instance head is similar to Mask R-CNN with certain modifications. This network is responsible for classification, object detection, and mask prediction.
The last part of the EfficientPS is the fusion module that fuses the prediction from both heads.
This fusion module is not parameterized—it doesn’t optimize itself during the backpropagation. It is rather a block that performs fusion in two stages.
In the first stage, the module obtains the corresponding class prediction, the confidence score bounding box, and mask logits. Then, the module:
In the first stage, the network sorts the class prediction, bounding box, and mask-logits with respect to the confidence scores.
In the second stage, it is the overlapping of the mask-logit that is evaluated.
It is done by calculating the sigmoid of the mask-logits. Every mask-logit that has a threshold greater than 0.5, obtains a corresponding binary mask. Furthermore, if the overlapping threshold between the binary is greater than a certain threshold, it is retained, while the others are removed.
A similar thing is done for the output yielded from the semantic head.
Once the segmentations from both heads are filtered, they are combined using the Hadamard product, and voila—we’ve just performed the panoptic segmentation.
If you’d like to put your knowledge to practice, here are a few Panoptic Segmentation datasets you can use:
Now, let’s discuss the most prominent applications of panoptic segmentation.
Radiologists deal with large volumes of visual data that is often difficult to interpret. For example, identifying cancer cells with a naked eye is extremely challenging due to factors such as the occlusion or saturation—and that’s where panoptic segmentation comes in handy
By incorporating panoptic segmentation in their workflows, radiologists can easily recognize tumor cells because this method allows them to separate the background from the foreground. This is crucial as both the instances and the amorphous regions help shape the context of the disease.
Furthermore, the algorithm can classify and create segmentation masks and bounding boxes around the identified tumor cells.
Autonomous vehicles are another area where panoptic segmentation is widely used.
Separating the foreground from the background provides a much better understanding of the distance-to-object estimation. This, in turn, helps the vehicles to make better decisions while steering, braking, and accelerating.
These days every smartphone is equipped with a camera.
Some of these cameras are very high-end and can capture photos or videos of up to 4k resolution. (Of course, they also need software that can help them amplify the images)
Panoptic segmentation can leverage its ability to separate things from stuff and can create effects like:
Panoptic segmentation isn’t a ground-breaking concept, but it does play a pivotal role in the field of computer vision. It is especially useful in the areas heavily relying on scene comprehension, including medicine, digital image processing, or autonomous vehicles.
We hope you have a much better idea now of how it works and how you can use it to solve various computer vision problems.
Here’s a short recap of everything we’ve covered:
💡 Read more: