A Practical Guide to Video Recognition [Overview and Tutorial]

Video recognition is a crucial part of computer vision. And yet, truly getting to grips with what it is, and how it works, can be a challenge. We've taken the time to provide an exhaustive guide on video recognition, key use cases, and core challenges that lie ahead.
Read time
min read  ·  
May 3, 2023
 video recognition guide

Video recognition is an essential component of computer vision, and has enjoyed a significant surge in popularity in recent years. With the widespread availability of digital cameras, smartphones, and surveillance systems, there has been an explosion of video data in different domains, from entertainment and sports to security and healthcare. As a result, video recognition has become an essential tool for analyzing and extracting insights from large video datasets.

In this article, we’re taking a deep dive into video recognition, covering the fundamentals of computer vision and machine learning techniques that underlie it. We’ll explore various applications of video recognition across different domains and also discuss the challenges of building video recognition systems. Better yet, we’ll tackle how deep learning techniques can help overcome them.

In this article, we cover:

So, let's get started!

Speed up your ML data labeling

Annotate your video and image datasets 10x faster

What is Video Recognition?

Video recognition is the process of analyzing and understanding the content of a video stream, typically involving the detection, tracking, and recognition of objects, scenes, and activities. It is an essential component of computer vision, which is concerned with automatically interpreting visual data from the world around us. The primary goal of video recognition is to extract meaningful information from raw video data, converting it into a structured representation that can be used for analysis and decision-making.

Video recognition systems can help to automate tedious tasks and provide valuable insights into complex systems. For example, video recognition can be used to detect and track vehicles, pedestrians, and traffic signs in autonomous driving systems or to detect and classify activities such as falling, walking, or running in healthcare monitoring systems.

Challenges of Video Recognition

While significant progress has been made in recent years, there are still several challenges that must be addressed to build accurate and robust video recognition systems. Some of the key challenges of video recognition are:

High Dimensionality of the Data

Video data is typically high-dimensional, with each frame containing millions of pixels. This makes it challenging to process and analyze data efficiently. For example, analyzing a 1-minute video clip at 30 frames per second would require processing over 100 million pixels.

High Variability

Another significant challenge in video recognition is the variability in appearances. For example, the same object may look different depending on the lighting conditions or camera angle. Similarly, different objects may look similar or even identical, making it challenging to distinguish between them.

Complexity of Object Interactions and Activities

Video recognition involves identifying not only individual objects but also their interactions and activities. For example, recognizing a person walking involves identifying the person in a frame and also detecting their temporal movement and direction. Similarly, recognizing a group of people playing soccer involves identifying the players, the ball, and their movements and interactions. These complex interactions and activities can make it challenging to build accurate and robust video recognition systems.

Limited Availability of Labeled Data

Another challenge in video recognition is the limited availability of data that has been annotated with labels or tags that describe the objects or activities present in the video. Labeled data is essential for training machine learning algorithms to recognize objects and activities in the video. However, labeling video data is a time-consuming and expensive process, which makes it challenging to obtain large amounts of labeled data. To address this challenge, researchers have developed techniques such as semi-supervised learning and active learning, which can help to reduce the amount of labeled data required for training.

Real-Time Performance

Video recognition is often required to be performed in real-time, such as in surveillance systems or autonomous vehicles. Real-time performance requires that the video recognition system can process and analyze video data typically at 30 frames per second or higher in real time. Achieving this performance can be challenging, particularly for deep learning techniques, which can be computationally intensive.

How does Video Recognition work?

Video Recognition with deep learning involves training neural networks to automatically learn relevant features from video data and use them to recognize objects and activities. 3D Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are two popular types of neural networks used in video recognition.

Popular Datasets

Before diving into popular datasets, it’s worth noting that we previously discussed some popular video datasets in our video classification article. Lets explore some other datasets commonly used for performance evaluation.

  • 3D-ZeF: 3D-ZeF is a multi-object tracking dataset consisting of videos of zebrafish for studying neurological disorders, social anxiety, and more. It contains 54 videos of zebrafish swimming in a cubic tank that is 7 cm on each side, with a resolution of 2048x2048 pixels and a frame rate of 100 frames per second.

  • A*3D: ApolloScape 3D or A*3D is a large-scale dataset for autonomous driving research in challenging environments. The dataset contains 39,179 LiDAR point clouds, 7 classes, and 230K 3D object annotations.

  • AVE: The Audio-Visual Event (AVE) dataset was developed to tackle the problem of audio-visual event localization in unconstrained videos– in both supervised and weakly supervised settings. AVE contains 4143 videos covering 28 event categories, and videos in AVE are temporally labeled with audio-visual event boundaries.

BDD100K: The BDD100K dataset is the largest and most diverse driving video dataset consisting of 100,000 videos representing more than 1000 hours of driving experience with more than 100 million frames. BDD100K is annotated for 10 different perception tasks in autonomous driving.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Deep learning approaches to Video Recognition

Deep learning has revolutionized the field of video recognition by providing powerful tools for automatically detecting and classifying objects and actions in video data. There are several common deep learning approaches used for a variety of video recognition tasks, such as action recognition, video retrieval, and video captioning.

Let's explore some of the most interesting ones in a bit more detail.


3D Convolutional Neural Networks (3D-CNNs) are an extension of 2D-CNNs and can process spatiotemporal data in a video. 3D-CNNs can learn to extract features from multiple video frames simultaneously, unlike 2D-CNNs that process frames one at a time, enabling them to capture the temporal dynamics of the video.

Compromised Metric via Optimal Transport (CMOT) is an interesting method that uses a 3D-CNN in a Few-Shot Learning framework for action recognition. CMOT simultaneously compares two videos’ content differences and ordering differences to give a compromised measurement under Optimal Transport (OT) framework, thus balancing semantic and temporal information in videos.

CMOT samples several segments from a video and gets their embeddings using a 3D-CNN to form a sequence of content representations. A semantic cost matrix is computed between their content representations. For example, segments from the same video will have a low cost value, while two segments taken from different videos will have a high cost value.

To preserve the inherent temporal ordering information (which segment occurs after which), CMOT additionally amends the semantic cost matrix by penalizing it with the positional distance between a pair of segments. For example, the last frame for one segment will be very close to the first frame of the next segment when they are sampled sequentially from the same video.

A metric classifier is then built for optimizing the 3D-CNN by calculating the distance between the videos as a transportation cost. Here “distance” can be thought of crudely as representing  similarity. Highly similar video segments will have low distance values and vice versa. This is the optimal transport framework adopted in CMOT.

The overview of the CMOT framework is shown below. CMOT achieves state-of-the-art results with their method beating contemporary methods by a fair margin.



Recurrent Neural Networks (RNNs) are a type of deep learning model that is commonly used for processing sequential data. It can be applied to video recognition tasks by treating each frame of a video as a time step in a sequence and processing the frames sequentially.

Basic structure of an RNN.

The output of the RNN at each time step is typically a set of features that capture the spatial and temporal information in the current frame. These features can then be combined across frames using a pooling operation or a separate classifier to make predictions about the video, such as recognizing specific actions or activities.

StagNet (spatio-temporal attention and semantic graph network) is an RNN architecture to tackle the group activity recognition problem. Group activity recognition focuses on the recognition of activities performed by multiple individuals in a group. The goal of the problem is to automatically detect and recognize the actions of each individual in a video, as well as the group dynamics and interactions that occur during the activity.

StagNet infers individual activities and their spatial relations and represents them by an explicit semantic graph. The temporal interactions are integrated by a structural-RNN model. A spatio-temporal attention mechanism is integrated on top of it to attach various levels of importance to different persons/frames in video sequences. 

That is, the activities of some objects in the video are more salient- like changes more rapidly spatially or temporally. Imagine a person sitting- not much changes between frames. Now imagine someone sprinting- a lot of temporal change. And you can spot both these persons in a park- hence assigning importance to objects becomes important when you are trying to detect group activity.

StagNet consists of a two-layer RNN that integrates two kinds of RNN units (i.e., nodeRNN and edgeRNN) into its framework, which is trained end-to-end. In particular, the first part is to construct the semantic graph from input frames, and then the temporal factor is incorporated by using a structural RNN. The inference is achieved via ‘message-passing’ and ‘factor-sharing’ mechanisms. 

In graphical models, nodes in the graph represent random variables, and edges represent the dependencies between them. The basic idea of message passing is to propagate information between nodes in the graph based on the observed data and the model assumptions. Message passing involves passing probability distributions between adjacent nodes in the graph, based on the conditional probabilities defined by the model. This allows the nodes to update their beliefs about the variables, based on the evidence from their neighbors.

Factor sharing is a related concept that involves using the same probability distribution to represent multiple variables in the model. By sharing factors across variables, the model can capture correlations between variables more efficiently, reducing the number of parameters that need to be estimated.

Finally, StagNet adopts a spatiotemporal attention mechanism to detect key persons and frames to improve performance further. That is, the attention mechanism identifies frames which are more important to the video sequence (the frames where the actual action occurs in a long video, for example), and the bounding boxes representing the objects which makes these frames so important (the actual persons performing the action, for example).


Some qualitative results obtained by the StagNet architecture are shown below. Here, you’ll see visualization results from a volleyball game, where the StagNet model assesses the positions of players, key individuals, and the movements of the two teams as a whole. The regions in the frames deemed as “important” by the model are shown as a heatmap, and the key individuals identified are shown within bounding boxes.


Siamese Networks

Siamese Networks typically consist of two identical deep neural networks that share weights and are trained to output similar features for pairs of similar videos and dissimilar features for pairs of dissimilar videos.

Basic Architecture of a Siamese Network.

In the case of video recognition, Siamese Networks are often used for tasks that involve detecting similarities or differences between pairs of videos, such as video retrieval, video similarity search, and video verification. In these tasks, the Siamese Network takes, as input, pairs of video clips and outputs a similarity score that indicates how similar the two clips are.

An example of such a network is the COSNet or CO-attention Siamese Network developed for unsupervised video object segmentation. During the training phase, COSNet takes a pair of frames from the same video as input and learns to capture their rich correlations. This is achieved by a differentiable, gated co-attention mechanism (a method to capture interdependencies between the two frames), which enables the network to to address similar components, to define and differentiate the features that don't match.

During testing, COSNet infers the primary target with a global view. In other words, it takes advantage of the co-attention information (interdependent relationships modeled into a joint embedding space capturing context information) between the testing frame and multiple reference frames. COSNet offers a unified, end-to-end trainable framework that efficiently mines rich contextual information within video sequences. Here is an overview of the COSNet architecture.


Here are a few qualitative results obtained by COSNet on three different datasets. These include a dancer, horses, and a bird in the video foreground.



Since video data is widely available, processing them frame-by-frame is exceptionally time-consuming, resulting in RNNs reaching their limit. Transformers are a class of deep networks that do not operate on the input data sequentially but instead process all the frames in parallel, making them faster and more efficient. This is accomplished by using self-attention mechanisms, which allow the transformer to focus selectively on different parts of the video sequence and attend to the most relevant information.

Video transformers have recently become powerful tools for comprehending video. An example of such a network is the Object-Region Video Transformers (ORViT) model for video understanding through object tracking. OrViT’s primary goal is to explicitly fuse object-centric representations into the spatio-temporal representations of video-transformer architectures and do so throughout the model layers, starting from the earlier layers.

Object-centric representations focus on identifying and localizing individual objects within an image or video (low-level representation), while spatio-temporal representations are high-level features posed for tasks like Action Recognition. Thus, fusing the object-centric representations makes the spatio-temporal features richer.

OrViT achieves this by adapting the self-attention block to incorporate object information. In a self-attention block, each input feature is used to compute attention scores (representing importance of the features for the particular task) with respect to every other input feature, allowing the model to capture long-range dependencies and interactions between features.

OrViT takes, as input, bounding boxes and patch tokens (also referred to as spatiotemporal representations) and outputs refined patch tokens based on object information. Patch tokens can be thought of as a way of breaking down an image into smaller, more manageable pieces, allowing the model to process the image more efficiently and effectively. By representing each patch as a learned feature vector, a model is able to capture both low-level and high-level visual features of the image, and to reason about the relationships between different patches.

Within the block, the information is processed by two separate object-level streams: an “Object-Region Attention” stream that models appearance and an “Object-Dynamics Module” stream that models trajectories.

The appearance stream first extracts descriptors for each object based on the object coordinates and the patch tokens. Next, the object descriptors are appended to the patch tokens, and self-attention is applied to all these tokens jointly, thus incorporating object information into the patch tokens.


The trajectory stream only uses object coordinates to model the geometry of motion and performs self-attention over those. Finally, both streams are re-integrated into a set of refined patch tokens, which have the same dimensionality as the input to the ORViT block– allowing the block to be called repeatedly. The overview of the ORViT architecture is shown above.

ORViT achieved state-of-the-art performance on compositional and few-shot action recognition and spatio-temporal action detection tasks. The object region attention module learned is visually represented below.


Other successful Video Transformers are the Video Swin Transformer and the BEVT models, both built for wide-range video recognition tasks.

Applications of Video Recognition

Video Recognition technology is widespread in the modern age. Let’s look at some of its most impactful applications in detail.

Security and surveillance

Video Recognition is widely used in video surveillance systems for detecting and recognizing objects and activities in real time. Video surveillance systems with advanced video recognition capabilities can detect and track suspicious behavior, alert security personnel to potential threats, and provide evidence for criminal investigations.

For example, this paper tackles anomaly detection in surveillance videos using Weakly Supervised Multiple Instance Learning (MIL). The problem is weakly supervised since only the video-level labels (i.e., whether a video is normal or contains an anomaly somewhere is known) are available, but the problem’s objective is to find where the anomaly lies temporally (finding specific time-stamps within a long video sequence).

MIL is a type of learning framework where the training data consists of groups of instances, called bags, rather than individual instances. In MIL, each bag is labeled with a binary label, indicating whether at least one of the instances in the bag belongs to a positive class (target class) or not. The goal of MIL is to learn a classifier that can accurately predict the label of unseen bags based on the instances they contain. Unlike traditional learning setups, where each instance is labeled individually, in MIL, the labels are assigned to bags, which may contain multiple instances of both positive and negative classes.

The authors propose to learn anomalies through a deep MIL framework by treating normal and anomalous surveillance videos as bags and short segments/clips of each video as instances in a bag. Based on training videos, the framework automatically learns a ranking model that predicts high anomaly scores for anomalous segments in a video. The overview of the approach is shown below.


The ROC (Receiver Operating Characteristics) curves (graphs that show the performance of a binary classification model) obtained by the authors for the anomaly detection binary class problem are shown below. The more a ROC curve is shifted to the top-left, the better the classifier.

Pro Tip: To learn more about how and why to compute ROC curves, follow the guide to confusion matrix.

Autonomous Driving

Video recognition is essential to autonomous vehicles, allowing them to successfully perceive and navigate their environment. Video recognition systems in autonomous vehicles can detect and track other vehicles, pedestrians, and obstacles on the road and make decisions based on the detected information.

Monocular depth estimation refers to estimating the depth of different objects in a scene using a single camera input. This is an important task in autonomous driving systems to understand which car/pedestrian/traffic signal is closer to the autonomous car. One exciting approach was developed in this paper, where the authors developed a Self-Supervised Learning model for depth estimation.

Here, the authors use a deep neural network architecture to predict the depth map and camera pose jointly from a single image. Along with photometric consistency (consistency between different views of the same scene), they introduced a scale-consistent geometric loss function that enforces the consistency between the predicted depth map and the estimated camera motion.

The loss function comprises both depth reconstruction loss and a forward-backward relative pose error to get more accurate results. The forward-backward relative pose computes the loss between object poses (geometric representations) in consecutive frames. This is effective, since two consecutive frames of the same video will have little change (imagine 30 fps- 30 photos clicked in one second), and capturing this relationship will help to predict future frames.

The overview of their workflow is shown below.


Results in the form of heatmaps (yellower shades mean that the region is closer) obtained by the authors are shown below.


Customer behavior analytics in retail

Video recognition can also be used in retail analytics to analyze customer behavior and preferences. Video recognition systems can detect and track customers’ movements in stores through security cameras, analyze their behavior, and provide insights into their preferences and purchase intent. This enables the efficient auditing of product placements in stores, leading tohigher sale revenue.

Pro Tip: To learn more about how AI is revolutionizing retail purchases,  check out our article on the future of AI and retail.

For example, Top-View Open World (TVOW) is a framework that performs Open World re-identification of people in top-view video data. Person re-identification (re-ID) is a problem that involves recognizing a person at different locations and times, involving different camera views, poses, and lighting.


TVOW uses a pre-trained deep convolutional neural network (CNN) fine-tuned on a dataset acquired via a top-view configuration (videos taken from top- like from a security camera installed in the ceiling) as shown in the image above.

The network is trained using Contrastive Learning by a triplet loss to optimize the embedding space such that data points with matching identities are closer to each other than those with different identities.

Simply put, triplet loss works like this: for every image called an anchor, we send a “positive” image, i.e., an image that belongs to the same class as the anchor, and a “negative” image or one that does not belong to the anchor image’s class. To learn more,  this article where we tackle the A to Z of triplet loss.


The TVOW approach obtains good results (over 90% in all cases), as shown below by the Cumulative Matching Curves (CMC) for closed-set re-ID at a fixed false accept rate. The CMC curve plots the cumulative percentage of correct matches on the y-axis against the rank of the retrieved image on the x-axis. In other words, it shows how many of the top N-ranked images returned by the system are correct matches for a given query image.


Traffic monitoring

Video recognition can aid traffic monitoring by providing real-time analysis and understanding of traffic flow. It can estimate traffic volume, analyze traffic flow, detect and classify vehicles, read license plates, detect pedestrians, and alert authorities to incidents. With advanced deep learning techniques, video recognition can become increasingly accurate and efficient in recognizing various traffic-related events and conditions. It has the potential to revolutionize traffic monitoring and improve road safety and efficiency.

Pro Tip: Learn more about how AI is stirring up the transport industry here: 9 Revolutionary AI Applications In Transportation.

License plate recognition is an important problem in this context that aims to automatically read and identify license plates on vehicles for detecting speeding vehicles or vehicles caught in an accident, etc.

Automated License Plate Recognition using OCR

V-LPDR is an example of a system that unifies license plate detection, tracking, and recognition into one computational framework via deep learning. A deep network captures the spatiotemporal features in V-LPDR for the detection task by aggregating the optical flow maps of adjacent frames. Then a multi-task CNN model is used to bridge video-based detection and recognition, which utilizes motion and deep appearance information.

Finally, high-quality frames are recommended by a neural network system which is then used to determine the license plate number using an autoencoder. This saves computational costs while also maintaining accuracy. The overview of the V-LPDR approach is shown below.

Some of the qualitative results obtained by V-LPDR are shown here. The algorithm can predict tight bounding boxes and track them accurately across frames.

Some examples of results obtained by V-LPDR.

How to Label Videos in V7?

Solving computer vision problems related to video recognition can be efficiently achieved using cloud-based services such as Google Cloud Vision API or Amazon Rekognition. These models provide excellent results for general video recognition tasks, including sentiment analysis, motion tracking, and landmark recognition.

However, if you need a custom solution tailored to your specific use case, training your own model on your own data is the way to go. In this tutorial, we'll demonstrate how to use the V7 AI platform for video annotation and model training.

Step 1: Determine Your Model's Inputs and Outputs

Before starting, identify the types of data annotation required for your computer vision task.

Type of video footage


Example outputs

Traffic monitoring camera footage

Object tracking

A series of bounding boxes with their XY coordinates

Live cell time lapse

Movement pattern analysis

A series of keypoint coordinates for cell movement

Security camera footage

Face detection and recognition

Bounding boxes with unique IDs

Sports footage

Player detection and action classification

Polygons/bounding boxes with attributes

Select the model type and label class structure that best suits your needs.

For instance, object detection models for video recognition often employ bounding boxes.

However, you can also use polygon annotations for training and box annotations for model predictions. In our example, we'll use bounding boxes with directional vectors representing the players' head positions.

Step 2: Upload Your Data to V7

To begin, collect a diverse set of sports videos that encompass the scenarios your model should handle. This may include various players, teams, camera angles, and lighting conditions, ensuring your model's ability to generalize to new data.

After creating a V7 account, set up a new dataset and name it. Drag and drop your training data videos and select an appropriate frame rate.

In some cases, extracting and preprocessing frames at a lower FPS rate from a larger number of videos may be more efficient. When prompted to add classes and choose a workflow type, use the default settings.

V7 uses workflows, which means each file in your dataset must pass through a sequence of steps, including:

Dataset →

Annotate →

Review →


Video files that are uploaded and ready for labeling

Labeling the data (with Auto-Annotate tools or manually)

Accepting or rejecting the labels

Annotated videos ready for model training

Step 3: Label Your Video Dataset

Open one of the uploaded videos in your dataset and choose the bounding box annotation tool from the annotation panel.

Label individual objects, indicating their current state with an attribute or by making them a separate class of the same type. In our example, we label players with bounding boxes and directional vectors for capturing their head positions.

Refine labels and morph them across multiple frames using the timeline in the annotation panel to select keyframes. Annotation changes will interpolate between keyframes automatically.

Alternatively, if you wish to bypass manual labeling entirely, replace the "Annotate" stage with a new "Model" stage and connect a computer vision model of your choice. You can register an external model, integrate HuggingFace models, or use one of the Public Models available.

Step 4: Review Annotations

The review process is a crucial component of building a reliable video recognition model. Ensuring the quality of your annotations significantly impacts the model's performance.

The Review Stage helps in identifying any errors or inconsistencies in the annotations, enabling you to rectify them before proceeding with model training.

In a team setting, it's a good practice to assign different members to the annotation and review tasks. This separation of responsibilities helps in maintaining an unbiased review process, allowing team members to spot errors or inconsistencies that the annotator might have missed.

Reviewing a large dataset can be time-consuming, especially with video annotation. It might not be practical to review every single data point. 

Instead, if you imported your video as individual frames, you can connect a sampling stage to your workflow and review only a portion of the annotations. This approach ensures that you maintain a high level of quality control while optimizing the time spent on reviewing.

Step 5: Train Your Video Recognition Model

With an annotated dataset in hand, you can now train a video object detection model. There are two scenarios:

Scenario A: Use V7's Models panel to train models in the cloud. This option is ideal for quickly creating a basic video recognition model that can be trained and deployed without external tools.

Scenario B: Use the same panel to register an external model of your choice. You’ll be able to connect your training data and use the Bring Your Own Model feature via the REST API.

Scenario C: Export annotations as JSON files and employ your preferred ML architecture. This approach, though more complex, allows you to fully utilize information like directional vectors, attributes, and any additional data available in Darwin JSON files.

The first option uses a visual interface and you can complete all steps in a matter of minutes. The other options require familiarity with the Darwin JSON file schema.

Step 6: Evaluate Your Model's Performance

Developing AI solutions is an iterative process. Continuously train multiple versions of your model, improving its performance by feeding it more data with high-quality annotations.

Here are two models connected to a Consensus Stage. This feature is useful for comparing the level of overlap between the output of multiple models.

To better understand the V7 platform and build your custom video recognition solution, we recommend exploring the documentation and feature pages in more detail.

If you're interested in experiencing the platform firsthand, consider booking a demo to learn how it can help you create advanced video recognition models tailored to your specific needs.

Final Words

In conclusion, video recognition is a rapidly evolving field with numerous applications across various industries. With rapid advancements in deep learning techniques and the availability of large-scale datasets, video recognition has made significant strides in recent years, achieving human-level or even superhuman-level performance in certain tasks.

In autonomous driving, video recognition plays a crucial role in enabling vehicles to navigate and interact with their environment safely and efficiently. In traffic monitoring, video recognition provides real-time analysis and understanding of traffic flow, improving road safety and efficiency.

While there are still challenges to overcome, such as variability and unpredictability in data, deep learning techniques have shown great potential in addressing these challenges and improving the accuracy and efficiency of video recognition systems.

As video recognition continues to advance, it has the potential to revolutionize a wide range of industries, from transportation to security to entertainment, improving our daily lives in countless ways.

Rohit Kundu is a Ph.D. student in the Electrical and Computer Engineering department of the University of California, Riverside. He is a researcher in the Vision-Language domain of AI and published several papers in top-tier conferences and notable peer-reviewed journals.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product