Video Segmentation: Intro, Methods, Tutorial

Discover various approaches and techniques used for video segmentation, and learn how to perform video segmentation with an AI tool.
Read time
min read  ·  
March 3, 2023
video segmentation hero image

Video segmentation is the process of partitioning videos into multiple regions based on certain characteristics, such as object boundaries, motion, color, texture, or other visual features. The goal of video segmentation is to identify and separate different objects from the background and temporal events in a video and to provide a more detailed and structured representation of the visual content.

It’s a crucial task for the fields of computer vision and multimedia—it allows for the identification and characterization of individual objects and events in the video, as well as for the organization and classification of the video content. Multiple techniques are constantly being developed to maximize accuracy and efficiency.

In this article, we’ll walk you through various approaches and techniques used for video segmentation, as well as the applications and challenges of this task. We’ll also show you how to get started on video segmentation with V7.

Ready to jump in? Here’s what we’ll cover:

Speed up your ML data labeling

Annotate your video and image datasets 10x faster

And if you’d like to start annotating your videos for video segmentation right away, check out:

What is video segmentation?

Video segmentation is a fundamental step in analyzing and understanding video content as it enables the extraction of meaningful information and features from the video. It involves dividing the video into individual segments or shots, typically defined by changes in the scene, camera angle, or other visual features. These segments can then be analyzed and characterized based on their content, duration, and other attributes, providing a basis for further analysis and understanding of the video.

Example of video segmentation (frame-by-frame) (source)

Video segmentation can be performed at various levels of granularity, ranging from the segmentation of individual objects or events within a shot to the segmentation of entire shots or scenes. It can also be performed at different stages of the video processing pipeline, from the raw video data to the extracted features or annotations.

The various approaches and techniques developed for video segmentation can be broadly classified into two categories

  1. Video Object Segmentation
  2. Video Semantic Segmentation
video segmentation division

Video object segmentation and video semantic segmentation are two important tasks in computer vision that aim to understand the contents of a video.

Video object segmentation focuses on tracking objects within a video and is used in applications such as surveillance and autonomous vehicles.

Video semantic segmentation focuses on understanding the overall scene and its contents and is used in applications such as augmented reality and video summarization. These tasks have different methods and evaluation metrics and are used in different application scenarios, which we will explore now.

💡 Pro tip: Need to brush up on your knowledge? Check out our 101 guide to image segmentation

Video Object Segmentation (VOS) methods and models

Video object segmentation is the task of segmenting and tracking specific objects within a video.

This is typically done by object initialization—identifying the object in the first frame of the video—and then tracking its movement throughout the rest of the video. The goal is to segment the object from the background and follow the changes in its movement. This task is useful in applications such as video surveillance, robotics, and autonomous vehicles.

There are various methods for object initialization, including:

  • manual annotation—the most accurate but also the most time-consuming
  • automatic annotation—the least accurate but the quickest
  • semi-automatic annotation—balancing accuracy and speed

Once the object has been initialized, it must be tracked throughout the rest of the video. There are various methods for object tracking, including traditional object tracking algorithms, such as the Kalman filter and the particle filter, and more recent deep learning-based methods. These deep learning-based methods typically use a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to segment and track objects.

Let's consider an example of a dancer whose legs and shoes are being annotated. The object segmentation task was achieved with the V7 video annotation tool:

A rough selection of the legs and shoes, followed by some minor adjustments, allows us to apply the annotation masks across multiple frames.

Evaluation of video object segmentation methods is typically done using metrics such as the Intersection over Union (IoU) and the Multiple Object Tracking Accuracy (MOTA). IoU measures the overlap between the predicted object mask and the ground truth mask, while MOTA measures the overall accuracy of the object tracking algorithm.

Unsupervised VOS

Unsupervised VOS, as the name suggests, aims to segment objects in a video without using any labeled data. This challenging task requires the model to learn the appearance and motion of objects in the video and to separate them from the background.

One popular approach to Unsupervised VOS is based on optical flow, a technique that estimates the motion of pixels between consecutive frames in a video. Optical flow can be used to track the motion of objects in the video and to segment them from the background.

An example of such a method is the Focus on Foreground Network (F2Net). It exploits center point information to focus on the foreground object. Unlike the common appearance matching-based methods, F2Net additionally establishes a “Center Prediction Branch” to estimate the center location of the primary object. Then, the predicted center point is encoded into a gauss map as the spatial guidance prior to enhancing the intra-frame and inter-frame feature matching in our Center Guiding Appearance Diffusion Module, leading the model to focus on the foreground object.

Overview of the F2Net architecture
Overview of the F2Net architecture (source)

After the appearance matching process, F2Net gets three kinds of information flows: inter-frame features, intra-frame features, and original semantic features of the current frame. Instead of fusing these three features by simple concatenation like the previous methods, F2Net uses an attention-based Dynamic Information Fusion Module to automatically select the most discriminative features leading to better segmentation performance.

Examples of the results obtained by the F2Net model are shown below.

Examples of the results obtained by the F2Net model

Semi-Supervised VOS

Semi-Supervised VOS methods use a small amount of labeled data to guide the segmentation process and unsupervised methods to refine the segmentation results. 

This approach leverages the strengths of both supervised and unsupervised methods to achieve higher efficiency and accuracy.

One of the key advantages of semi-supervised video object segmentation is that it requires less labeled data than supervised methods. This is particularly useful in cases where obtaining labeled data is difficult or expensive. Additionally, the unsupervised methods used in semi-supervised video object segmentation can help to improve the robustness and generalization of the segmentation results, as they can take into account additional context and information that may not be present in the labeled data.

For example, the “Sparse Spatiotemporal Transformers (SST)” model proposed in 2021 uses semi-supervised learning for the VOS task. SST processes videos in a single feedforward pass of an efficient attention-based network. At every layer of this net, each spatiotemporal feature vector simultaneously interacts with all other feature vectors in the video.

Overview of the Sparse Spatiotemporal Transformers model (source)

Furthermore, SST is feedforward, so it avoids the compounding error issue inherent in recurrent methods. SST addresses computational complexity using sparse attention operator variants, making it possible to apply self-attention to high-resolution videos.

Here’s an example of the qualitative performance obtained by SST juxtaposed with the then state-of-the-art method CFBI:

Qualitative results obtained by the SST model compared to the state-of-the-art (source)
💡 Read more: Check out our guide on supervised vs. unsupervised learning

Interactive VOS

Interactive VOS is a technique used to segment and track objects within a video in real time. The interactive aspect of this technique refers to the user’s ability to provide input to the algorithm—for example, specify the initial location of an object in the first frame of the video or draw a bounding box around the object. This user input can then guide the algorithm in its segmentation and tracking of the object throughout the rest of the video.

One of the main benefits of Interactive VOS is its ability to improve object segmentation and tracking accuracy and reliability, especially in cases where the objects are partially occluded or have similar appearances to other objects in the video. This technique can also train more accurate object detection models by providing additional annotated data.

An example of Interactive VOS is the framework proposed in this paper. The authors’ focus in the framework was to determine which frame to select for the user to annotate and to which frame should be annotated to bring the most information to the model.

The authors formulated the frame recommendation problem as a Markov Decision Process (MDP) and trained the recommendation agent with Deep Reinforcement Learning. To narrow the state space, they defined the state as the segmentation quality of each frame instead of the image frames and segmentation masks.

interactive vos framework

Given that the user scribbles on the recommended frame, the framework leverages off-the-shelf interactive VOS algorithms to refine the segmentation masks. Without any ground-truth information, the learned agent can recommend the frame for annotation. Qualitative results obtained by the authors on two different datasets are shown below.

qualitative comparison on davis and youtube-vos dataset

Language-guided VOS

Language-guided VOS is a technique that uses natural language input to guide the segmentation and tracking of objects within a video. This is typically done by using a combination of machine learning algorithms, such as Convolutional Neural Networks (CNNs) and Recurrent Neural networks (RNNs), and Natural Language Processing (NLP) techniques to understand the user's input.

The main advantage of using natural language input is that it allows for more flexible and intuitive interaction with the algorithm. For example, instead of manually specifying the initial location of an object in the first frame of the video, a user can simply provide a verbal description of the object, such as "the red car" or "the person wearing a blue shirt." This can be especially useful in cases where the objects are difficult to locate or have similar appearances to other objects in the video.

Examples of Language-Guided VOS (source)

To achieve this, the algorithm first uses NLP techniques to process the user's input and extract relevant information about the object to be segmented and tracked. This information is then used to guide the segmentation and tracking process, for example, by using the object's color or shape as a cue.

One such framework is the Multimodal Tracking Transformer (MTTR) model, where the objective is to segment text-referred object instances in the frames of a given video. For this, the MTTR model extracts linguistic features from the text query using a standard Transformer-based text encoder and visual features from the video frames using a spatiotemporal encoder. The features are then passed into a multimodal Transformer, which outputs several sequences of object predictions.

detailed overview of multimodal tracking transformer

Finally, to determine which of the predicted sequences best corresponds to the referred object, MTTR computes a text-reference score for each sequence for which a temporal segment voting scheme is developed. This allows the model to focus on more relevant parts of the video when making the decision. The overview of the MTTR pipeline is shown above.

Examples of the performance obtained by the MTTR model based on the text and video queries are shown below.

MTTR's performance on the Refer-YouTube-VOS

Video Semantic Segmentation (VSS) methods and models

Video semantic segmentation is the task of segmenting and understanding the semantic content of a video. This includes not only segmenting objects but also understanding their meaning and context. For example, a video semantic segmentation model might be able to identify that a person is walking on a sidewalk, a car is driving on the road, and a building is a skyscraper. The goal is to understand the scene and its contents rather than just tracking specific objects. This task is helpful in applications such as scene understanding, augmented reality, and video summarization.

The process of video semantic segmentation typically begins with extracting features from the video frames using convolutional neural networks (CNNs). CNNs can learn hierarchical representations of the image data, allowing them to understand the contents of the image at multiple levels of abstraction.

Once the features are extracted, they are used to classify each pixel in the video. This is typically done using a fully convolutional network (FCN), a type of CNN designed for dense prediction tasks. FCNs can take an input image and produce a dense output, where each pixel in the output corresponds to a class label (“object” or “background,” for example).

Video semantic segmentation methods are evaluated using metrics such as the mean Intersection over Union (mIoU) and the Pixel Accuracies (PA). mIoU measures the average overlap between the predicted object mask and the ground truth mask, while PA measures the overall accuracy of the object segmentation algorithm.

(Instance-Agnostic) Video Semantic Segmentation

Instance-agnostic VSS is a method to identify and segment objects in a video sequence without considering the individual instances of the objects. This approach is in contrast to instance-aware semantic segmentation, which tracks and segments individual instances of objects within a video, making it less computationally demanding.

The Temporally Distributed Network (TDNet) is an example of a video instance segmentation architecture inspired by Group Convolutions, which shows that extracting features with separated filter groups not only allows for model parallelization but also helps learn better representations.

Given a deep image segmentation network, TDNet divides the features extracted by the deep model into N (e.g., N=2 or 4) groups and uses N distinct shallow sub-networks to approximate each group of feature channels. By forcing each sub-network to cover a separate feature subspace, a strong feature representation can be produced by reassembling the output of these sub-networks. For balanced and efficient computation over time, the N sub-networks share the same shallow architecture, which is set to be (1/N) of the original deep model’s size to preserve a similar total model capacity.

TDNet with four sub-networks

The architecture is coupled with a grouped Knowledge Distillation loss to accelerate the semantic segmentation models for videos. The overview of the TDNet workflow is shown above, and some qualitative results obtained by the model are shown below.

qualitative results obtained by TDNet

Video Instance Segmentation

Video instance segmentation identifies and segments individual instances of objects within a video sequence. This approach is in contrast to the instance-agnostic semantic segmentation, which only identifies and segments objects within a video without considering individual instances.

A visual example depicting the difference between these two classes of video segmentation algorithms is shown below.

Difference between video (instance-agnostic) semantic and instance segmentation

Video instance segmentation Transformer (VisTR) is a framework built for instance segmentation that views the instance segmentation task as a parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, the VisTR outputs the sequence of masks for each instance in the video directly.

First, given a sequence of video frames, a standard CNN module extracts features of individual image frames. The multiple image features are concatenated in the frame order to form the clip-level feature sequence. Next, the Transformer takes the clip-level feature sequence as input and outputs a sequence of object predictions in order.

the overall architecture of Vi

The sequence of predictions follows the order of input images, and the predictions of each image follow the same instance order. Thus, instance tracking is achieved seamlessly and naturally in the same framework of instance segmentation. An overview of the VisTR model is shown above and some qualitative results obtained are shown below.

Visualization of VisTR

Video Panoptic Segmentation

Video panoptic segmentation (VPS) identifies and segments both objects and their parts in a video sequence in a single step. This approach combines the strengths of both instance-agnostic semantic segmentation and video instance segmentation.

The main advantage of VPS is that it can differentiate between objects, object parts, and backgrounds in a video, providing a more detailed understanding of the scene. It also allows us to distinguish and segment multiple instances of the same object in a video, even when they overlap, which comes at the cost of high computational demand. This is particularly useful for applications such as video surveillance, autonomous vehicles, and drones.

An example of such a framework is the ViP-DeepLab model that performs Depth-aware Video Panoptic Segmentation (DVPS) as a step toward solving the inverse projection problem (which refers to the ambiguous mapping from the retinal images to the sources of retinal stimulation).

The authors found that video panoptic segmentation can be modeled as concatenated image panoptic segmentation. Motivated by this, they extended the Panoptic-DeepLab model to perform center regression for two consecutive frames with respect to only the object centers appearing in the first frame. During inference, this offset prediction allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped with the previously detected instances. The schematic workflow of the model looks like this:

schematic workflow of Panoptic-DeepLab model

Some qualitative results obtained by ViP-DeepLab are shown below.

Challenges and Limitations of Video Segmentation

Despite the many benefits and applications of video segmentation, there are also several challenges and limitations that need to be considered. Some of the key challenges and limitations of video segmentation include the following:

  • Variability in video content and quality. This can include variations in lighting, resolution, frame rate, and other factors that can affect the appearance and characteristics of the video. Various methods have been developed over the years for dealing with large variations in object appearance, including multi-scale features, deep learning-based methods, and domain adaptation techniques. Methods for dealing with changing lighting and viewpoints include using color histograms or texture features.
  • Lack of temporal consistency. Videos are a sequence of frames, and the contents of the scene can change significantly from frame to frame. This makes it difficult to maintain consistency in the segmentation across frames. Methods for dealing with temporal consistency include using recurrent neural networks (RNNs), optical flow, or motion features.
  • Occlusions. Occlusions occur when one object blocks the view of another object, making it difficult or impossible to track. There are various methods for dealing with occlusions, including using multiple cameras or sensors, depth sensors, and object re-detection.
  • Complexity of visual scenes. Video segmentation can be challenging due to the complexity of the visual scenes depicted in the video. This can include the presence of multiple objects and events, as well as occlusions, reflections, and other visual distractions that can make it challenging to identify and segment the content of the video.
  • Lack of training data. Supervised approaches for video segmentation require the availability of labeled training data, which can be challenging to obtain for many video datasets. This can limit the effectiveness and generalizability of these approaches.
  • Computational complexity. Video segmentation can be computationally intensive, especially for large or high-resolution video datasets. This poses challenges in performing real-time or online video segmentation or scaling the segmentation process to extensive video collections.
  • Evaluation and benchmarking: Evaluating the performance of video segmentation approaches can be difficult due to the lack of standardized benchmarks and evaluation metrics. This can make it challenging to compare and evaluate different approaches or to determine the best approach for a given video dataset.

Applications of video segmentation 

The applications of video segmentation are varied and can be used in many different industries.

In today’s social media-driven world, an important application of video segmentation is in video editing, where an AI model can automatically identify and extract specific scenes or actions from a video. This can save editors a lot of time and effort for editors, allowing them to quickly and easily create new videos from existing footage.

Another application is in surveillance, where video segmentation can be used to automatically identify and track specific objects or people in a video. This can be used for security purposes, such as identifying potential threats or detecting suspicious behavior.

Video segmentation is also applicable in the field of sports analysis. By automatically identifying and tracking players and actions within a video, it can be used to analyze and improve player performance and help coaches make strategic decisions.

In entertainment, video segmentation can automatically generate captions and subtitles for videos, making them more accessible to a wider audience.

In transportation, video segmentation can be used to analyze footage from cameras on vehicles. This might help to identify and prevent accidents, as well as monitor driver behavior.

💡 Pro tip: Learn more about 9 Revolutionary AI Applications In Transportation

Data labeling for video segmentation in V7: Short guide

V7 video annotation and tagging features let you perform accurate annotations for semantic, instance, and panoptic segmentation or object detection.

Let’s go through a quick tutorial on annotating your videos with the help of the V7 toolset.

Or, jump right into annotating your videos with V7!

Step 1. Import your video

Once you have successfully signed up for a V7 account go to the Datasets panel and add a new dataset. You can then drag and drop your video file. In this tutorial, we’re going to create segmentation masks of a swimming stingray. With this kind of footage, it is a good idea to keep a high frame rate. We’ll keep the native FPS.

settings for video file in v7

Select the default options while setting up the remaining dataset setup steps.

Step 2. Go to the annotation panel and create a new class

Once your video is imported, you can open the video in your dataset. The view will switch to the annotation panel. You can now pick the Auto-Annotate tool (the second from the top). To create generic auto-annotations you need to create a new polygon class. This will give us more flexibility and we’ll be able to outline irregular shapes in the video.

creating a new polygon class in v7

Step 3. Annotate the first frame and adjust the length of the annotation

Now, we can select the stingray using the Auto-Annotate tool. Just drag across the frame to delineate the area with the fish. This AI video segmentation tool will automatically create a polygon mask.

auto-annotating a stingray in v7

When you add the annotation, it appears on the timeline panel at the bottom of the panel. You can extend this to cover the whole length of your video or just a selected fragment of a scene.

Step 4: Create keyframes by re-annotating different positions

Use the timeline to move to a different point in the video where the stingray's position has changed. Click “Rerun” to create a new keyframe. Readjust the area of the annotation if necessary. This will create a new segmentation mask.

The shape of your mask will morph between keyframes on its own. This means that you can annotate just several frames of the video, not every single one. This can significantly save your time and makes the task much easier.

Step 5. Complete the annotations and export the results

Once you have annotated the entire video, move it to the "Complete" stage.

moving a file to complete stage in v7

This will allow you to export the annotation masks in different formats for training your machine learning model. Or, if you want to test your video segmentation model first, you can train the model online on the platform. However, in most cases, you will probably use your own framework. If you need annotated training data you can download your annotations as a JSON file or export segmentation masks.

Here are the PNG masks generated from our annotations:

png masks generated from video annotations in v7

That's it! By following these steps, you can use V7 annotation tools to segment any object in a video, create keyframes, and export the results for training your model.

And if you want to see more video segmentation examples, here’s a good overview of the whole process:

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Final words

Video segmentation is a fundamental task in the field of computer vision and multimedia, used for the analysis and understanding of video content. Various approaches and techniques have been developed for video segmentation, ranging from supervised methods that rely on labeled training data to unsupervised methods that rely on the inherent structure of the video.

Video segmentation has many real-world applications, including content-based video retrieval, summarization, annotation and labeling, indexing and organization, and video analytics and understanding. However, as with any newly-developed technology, video segmentation has several challenges and limitations: the variability in video content and quality, the complexity of visual scenes, the lack of training data, and the computational complexity of the task. Evaluating the performance of video segmentation approaches is important for understanding their capabilities and limitations.

Rohit Kundu is a Ph.D. student in the Electrical and Computer Engineering department of the University of California, Riverside. He is a researcher in the Vision-Language domain of AI and published several papers in top-tier conferences and notable peer-reviewed journals.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product