Video segmentation is the process of partitioning videos into multiple regions based on certain characteristics, such as object boundaries, motion, color, texture, or other visual features. The goal of video segmentation is to identify and separate different objects from the background and temporal events in a video and to provide a more detailed and structured representation of the visual content.
It’s a crucial task for the fields of computer vision and multimedia—it allows for the identification and characterization of individual objects and events in the video, as well as for the organization and classification of the video content. Multiple techniques are constantly being developed to maximize accuracy and efficiency.
In this article, we’ll walk you through various approaches and techniques used for video segmentation, as well as the applications and challenges of this task. We’ll also show you how to get started on video segmentation with V7.
Ready to jump in? Here’s what we’ll cover:
Solve any video or image labeling task 10x faster and with 10x less manual work.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
And if you’d like to start annotating your videos for video segmentation right away, check out:
Video segmentation is a fundamental step in analyzing and understanding video content as it enables the extraction of meaningful information and features from the video. It involves dividing the video into individual segments or shots, typically defined by changes in the scene, camera angle, or other visual features. These segments can then be analyzed and characterized based on their content, duration, and other attributes, providing a basis for further analysis and understanding of the video.
Video segmentation can be performed at various levels of granularity, ranging from the segmentation of individual objects or events within a shot to the segmentation of entire shots or scenes. It can also be performed at different stages of the video processing pipeline, from the raw video data to the extracted features or annotations.
The various approaches and techniques developed for video segmentation can be broadly classified into two categories
Video object segmentation and video semantic segmentation are two important tasks in computer vision that aim to understand the contents of a video.
Video object segmentation focuses on tracking objects within a video and is used in applications such as surveillance and autonomous vehicles.
Video semantic segmentation focuses on understanding the overall scene and its contents and is used in applications such as augmented reality and video summarization. These tasks have different methods and evaluation metrics and are used in different application scenarios, which we will explore now.
Video object segmentation is the task of segmenting and tracking specific objects within a video.
This is typically done by object initialization—identifying the object in the first frame of the video—and then tracking its movement throughout the rest of the video. The goal is to segment the object from the background and follow the changes in its movement. This task is useful in applications such as video surveillance, robotics, and autonomous vehicles.
There are various methods for object initialization, including:
Once the object has been initialized, it must be tracked throughout the rest of the video. There are various methods for object tracking, including traditional object tracking algorithms, such as the Kalman filter and the particle filter, and more recent deep learning-based methods. These deep learning-based methods typically use a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to segment and track objects.
Let's consider an example of a dancer whose legs and shoes are being annotated. The object segmentation task was achieved with the V7 video annotation tool:
A rough selection of the legs and shoes, followed by some minor adjustments, allows us to apply the annotation masks across multiple frames.
Evaluation of video object segmentation methods is typically done using metrics such as the Intersection over Union (IoU) and the Multiple Object Tracking Accuracy (MOTA). IoU measures the overlap between the predicted object mask and the ground truth mask, while MOTA measures the overall accuracy of the object tracking algorithm.
Unsupervised VOS, as the name suggests, aims to segment objects in a video without using any labeled data. This challenging task requires the model to learn the appearance and motion of objects in the video and to separate them from the background.
One popular approach to Unsupervised VOS is based on optical flow, a technique that estimates the motion of pixels between consecutive frames in a video. Optical flow can be used to track the motion of objects in the video and to segment them from the background.
An example of such a method is the Focus on Foreground Network (F2Net). It exploits center point information to focus on the foreground object. Unlike the common appearance matching-based methods, F2Net additionally establishes a “Center Prediction Branch” to estimate the center location of the primary object. Then, the predicted center point is encoded into a gauss map as the spatial guidance prior to enhancing the intra-frame and inter-frame feature matching in our Center Guiding Appearance Diffusion Module, leading the model to focus on the foreground object.
After the appearance matching process, F2Net gets three kinds of information flows: inter-frame features, intra-frame features, and original semantic features of the current frame. Instead of fusing these three features by simple concatenation like the previous methods, F2Net uses an attention-based Dynamic Information Fusion Module to automatically select the most discriminative features leading to better segmentation performance.
Examples of the results obtained by the F2Net model are shown below.
Semi-Supervised VOS methods use a small amount of labeled data to guide the segmentation process and unsupervised methods to refine the segmentation results.
This approach leverages the strengths of both supervised and unsupervised methods to achieve higher efficiency and accuracy.
One of the key advantages of semi-supervised video object segmentation is that it requires less labeled data than supervised methods. This is particularly useful in cases where obtaining labeled data is difficult or expensive. Additionally, the unsupervised methods used in semi-supervised video object segmentation can help to improve the robustness and generalization of the segmentation results, as they can take into account additional context and information that may not be present in the labeled data.
For example, the “Sparse Spatiotemporal Transformers (SST)” model proposed in 2021 uses semi-supervised learning for the VOS task. SST processes videos in a single feedforward pass of an efficient attention-based network. At every layer of this net, each spatiotemporal feature vector simultaneously interacts with all other feature vectors in the video.
Furthermore, SST is feedforward, so it avoids the compounding error issue inherent in recurrent methods. SST addresses computational complexity using sparse attention operator variants, making it possible to apply self-attention to high-resolution videos.
Here’s an example of the qualitative performance obtained by SST juxtaposed with the then state-of-the-art method CFBI:
Interactive VOS is a technique used to segment and track objects within a video in real time. The interactive aspect of this technique refers to the user’s ability to provide input to the algorithm—for example, specify the initial location of an object in the first frame of the video or draw a bounding box around the object. This user input can then guide the algorithm in its segmentation and tracking of the object throughout the rest of the video.
One of the main benefits of Interactive VOS is its ability to improve object segmentation and tracking accuracy and reliability, especially in cases where the objects are partially occluded or have similar appearances to other objects in the video. This technique can also train more accurate object detection models by providing additional annotated data.
An example of Interactive VOS is the framework proposed in this paper. The authors’ focus in the framework was to determine which frame to select for the user to annotate and to which frame should be annotated to bring the most information to the model.
The authors formulated the frame recommendation problem as a Markov Decision Process (MDP) and trained the recommendation agent with Deep Reinforcement Learning. To narrow the state space, they defined the state as the segmentation quality of each frame instead of the image frames and segmentation masks.
Given that the user scribbles on the recommended frame, the framework leverages off-the-shelf interactive VOS algorithms to refine the segmentation masks. Without any ground-truth information, the learned agent can recommend the frame for annotation. Qualitative results obtained by the authors on two different datasets are shown below.
Language-guided VOS is a technique that uses natural language input to guide the segmentation and tracking of objects within a video. This is typically done by using a combination of machine learning algorithms, such as Convolutional Neural Networks (CNNs) and Recurrent Neural networks (RNNs), and Natural Language Processing (NLP) techniques to understand the user's input.
The main advantage of using natural language input is that it allows for more flexible and intuitive interaction with the algorithm. For example, instead of manually specifying the initial location of an object in the first frame of the video, a user can simply provide a verbal description of the object, such as "the red car" or "the person wearing a blue shirt." This can be especially useful in cases where the objects are difficult to locate or have similar appearances to other objects in the video.
To achieve this, the algorithm first uses NLP techniques to process the user's input and extract relevant information about the object to be segmented and tracked. This information is then used to guide the segmentation and tracking process, for example, by using the object's color or shape as a cue.
One such framework is the Multimodal Tracking Transformer (MTTR) model, where the objective is to segment text-referred object instances in the frames of a given video. For this, the MTTR model extracts linguistic features from the text query using a standard Transformer-based text encoder and visual features from the video frames using a spatiotemporal encoder. The features are then passed into a multimodal Transformer, which outputs several sequences of object predictions.
Finally, to determine which of the predicted sequences best corresponds to the referred object, MTTR computes a text-reference score for each sequence for which a temporal segment voting scheme is developed. This allows the model to focus on more relevant parts of the video when making the decision. The overview of the MTTR pipeline is shown above.
Examples of the performance obtained by the MTTR model based on the text and video queries are shown below.
Video semantic segmentation is the task of segmenting and understanding the semantic content of a video. This includes not only segmenting objects but also understanding their meaning and context. For example, a video semantic segmentation model might be able to identify that a person is walking on a sidewalk, a car is driving on the road, and a building is a skyscraper. The goal is to understand the scene and its contents rather than just tracking specific objects. This task is helpful in applications such as scene understanding, augmented reality, and video summarization.
The process of video semantic segmentation typically begins with extracting features from the video frames using convolutional neural networks (CNNs). CNNs can learn hierarchical representations of the image data, allowing them to understand the contents of the image at multiple levels of abstraction.
Once the features are extracted, they are used to classify each pixel in the video. This is typically done using a fully convolutional network (FCN), a type of CNN designed for dense prediction tasks. FCNs can take an input image and produce a dense output, where each pixel in the output corresponds to a class label (“object” or “background,” for example).
Video semantic segmentation methods are evaluated using metrics such as the mean Intersection over Union (mIoU) and the Pixel Accuracies (PA). mIoU measures the average overlap between the predicted object mask and the ground truth mask, while PA measures the overall accuracy of the object segmentation algorithm.
Instance-agnostic VSS is a method to identify and segment objects in a video sequence without considering the individual instances of the objects. This approach is in contrast to instance-aware semantic segmentation, which tracks and segments individual instances of objects within a video, making it less computationally demanding.
The Temporally Distributed Network (TDNet) is an example of a video instance segmentation architecture inspired by Group Convolutions, which shows that extracting features with separated filter groups not only allows for model parallelization but also helps learn better representations.
Given a deep image segmentation network, TDNet divides the features extracted by the deep model into N (e.g., N=2 or 4) groups and uses N distinct shallow sub-networks to approximate each group of feature channels. By forcing each sub-network to cover a separate feature subspace, a strong feature representation can be produced by reassembling the output of these sub-networks. For balanced and efficient computation over time, the N sub-networks share the same shallow architecture, which is set to be (1/N) of the original deep model’s size to preserve a similar total model capacity.
The architecture is coupled with a grouped Knowledge Distillation loss to accelerate the semantic segmentation models for videos. The overview of the TDNet workflow is shown above, and some qualitative results obtained by the model are shown below.
Video instance segmentation identifies and segments individual instances of objects within a video sequence. This approach is in contrast to the instance-agnostic semantic segmentation, which only identifies and segments objects within a video without considering individual instances.
A visual example depicting the difference between these two classes of video segmentation algorithms is shown below.
Video instance segmentation Transformer (VisTR) is a framework built for instance segmentation that views the instance segmentation task as a parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, the VisTR outputs the sequence of masks for each instance in the video directly.
First, given a sequence of video frames, a standard CNN module extracts features of individual image frames. The multiple image features are concatenated in the frame order to form the clip-level feature sequence. Next, the Transformer takes the clip-level feature sequence as input and outputs a sequence of object predictions in order.
The sequence of predictions follows the order of input images, and the predictions of each image follow the same instance order. Thus, instance tracking is achieved seamlessly and naturally in the same framework of instance segmentation. An overview of the VisTR model is shown above and some qualitative results obtained are shown below.
Video panoptic segmentation (VPS) identifies and segments both objects and their parts in a video sequence in a single step. This approach combines the strengths of both instance-agnostic semantic segmentation and video instance segmentation.
The main advantage of VPS is that it can differentiate between objects, object parts, and backgrounds in a video, providing a more detailed understanding of the scene. It also allows us to distinguish and segment multiple instances of the same object in a video, even when they overlap, which comes at the cost of high computational demand. This is particularly useful for applications such as video surveillance, autonomous vehicles, and drones.
An example of such a framework is the ViP-DeepLab model that performs Depth-aware Video Panoptic Segmentation (DVPS) as a step toward solving the inverse projection problem (which refers to the ambiguous mapping from the retinal images to the sources of retinal stimulation).
The authors found that video panoptic segmentation can be modeled as concatenated image panoptic segmentation. Motivated by this, they extended the Panoptic-DeepLab model to perform center regression for two consecutive frames with respect to only the object centers appearing in the first frame. During inference, this offset prediction allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped with the previously detected instances. The schematic workflow of the model looks like this:
Some qualitative results obtained by ViP-DeepLab are shown below.
Despite the many benefits and applications of video segmentation, there are also several challenges and limitations that need to be considered. Some of the key challenges and limitations of video segmentation include the following:
The applications of video segmentation are varied and can be used in many different industries.
In today’s social media-driven world, an important application of video segmentation is in video editing, where an AI model can automatically identify and extract specific scenes or actions from a video. This can save editors a lot of time and effort for editors, allowing them to quickly and easily create new videos from existing footage.
Another application is in surveillance, where video segmentation can be used to automatically identify and track specific objects or people in a video. This can be used for security purposes, such as identifying potential threats or detecting suspicious behavior.
Video segmentation is also applicable in the field of sports analysis. By automatically identifying and tracking players and actions within a video, it can be used to analyze and improve player performance and help coaches make strategic decisions.
In entertainment, video segmentation can automatically generate captions and subtitles for videos, making them more accessible to a wider audience.
In transportation, video segmentation can be used to analyze footage from cameras on vehicles. This might help to identify and prevent accidents, as well as monitor driver behavior.
V7 video annotation and tagging features let you perform accurate annotations for semantic, instance, and panoptic segmentation or object detection.
Let’s go through a quick tutorial on annotating your videos with the help of the V7 toolset.
Or, jump right into annotating your videos with V7!
Once you have successfully signed up for a V7 account go to the Datasets panel and add a new dataset. You can then drag and drop your video file. In this tutorial, we’re going to create segmentation masks of a swimming stingray. With this kind of footage, it is a good idea to keep a high frame rate. We’ll keep the native FPS.
Select the default options while setting up the remaining dataset setup steps.
Once your video is imported, you can open the video in your dataset. The view will switch to the annotation panel. You can now pick the Auto-Annotate tool (the second from the top). To create generic auto-annotations you need to create a new polygon class. This will give us more flexibility and we’ll be able to outline irregular shapes in the video.
Now, we can select the stingray using the Auto-Annotate tool. Just drag across the frame to delineate the area with the fish. This AI video segmentation tool will automatically create a polygon mask.
When you add the annotation, it appears on the timeline panel at the bottom of the panel. You can extend this to cover the whole length of your video or just a selected fragment of a scene.
Use the timeline to move to a different point in the video where the stingray's position has changed. Click “Rerun” to create a new keyframe. Readjust the area of the annotation if necessary. This will create a new segmentation mask.
The shape of your mask will morph between keyframes on its own. This means that you can annotate just several frames of the video, not every single one. This can significantly save your time and makes the task much easier.
Once you have annotated the entire video, move it to the "Complete" stage.
This will allow you to export the annotation masks in different formats for training your machine learning model. Or, if you want to test your video segmentation model first, you can train the model online on the platform. However, in most cases, you will probably use your own framework. If you need annotated training data you can download your annotations as a JSON file or export segmentation masks.
Here are the PNG masks generated from our annotations:
That's it! By following these steps, you can use V7 annotation tools to segment any object in a video, create keyframes, and export the results for training your model.
And if you want to see more video segmentation examples, here’s a good overview of the whole process:
Video segmentation is a fundamental task in the field of computer vision and multimedia, used for the analysis and understanding of video content. Various approaches and techniques have been developed for video segmentation, ranging from supervised methods that rely on labeled training data to unsupervised methods that rely on the inherent structure of the video.
Video segmentation has many real-world applications, including content-based video retrieval, summarization, annotation and labeling, indexing and organization, and video analytics and understanding. However, as with any newly-developed technology, video segmentation has several challenges and limitations: the variability in video content and quality, the complexity of visual scenes, the lack of training data, and the computational complexity of the task. Evaluating the performance of video segmentation approaches is important for understanding their capabilities and limitations.
“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”