Pose estimation is a computer vision task that is usually tackled through deep learning.
It is one of the most interesting areas of research that has gained a lot of traction because of its usefulness and versatility—it finds applications in a wide range of fields including gaming, healthcare, AR, and sports.
This article will give you a comprehensive overview of what human pose estimation is and how it works. We promise to keep things simple 😉
It will also cover different HPE approaches—both classical and deep learning-based methods, metrics and evaluation techniques, and much more.
After reading this article, you'll learn:
And If you prefer to get hands-on experience annotating data for your Human Pose Estimation projects, make sure to check out the video below.
Now, let's dive in.
Human Pose Estimation (HPE) is a way of identifying and classifying the joints in the human body.
Essentially it is a way to capture a set of coordinates for each joint (arm, head, torso, etc.,) which is known as a key point that can describe a pose of a person. The connection between these points is known as a pair.
The connection formed between the points has to be significant, which means not all points can form a pair. From the outset, the aim of HPE is to form a skeleton-like representation of a human body and then process it further for task-specific applications.
There are three types of approaches to model the human body:
HPE approaches are primarily in the area of computer vision, and it is used to understand geometric and motion information of the human body, which can be very intricate.
This section explores the two approaches: the classical approach and the deep learning-based approach to HPE. We will also explain how classical approaches fail to capture the geometric and motion information of the human body, and how deep learning algorithms such as the CNNs excel at it.
Classical approaches usually refer to techniques and methods involving swallow machine learning algorithms.
For instance, the earlier work to estimate human pose included the implementation of random forest within a “pictorial structure framework”. This was used to predict joints in the human body.
The pictorial structure framework (PSF) is commonly referred to as one of the traditional methods to estimate human pose. PSF contained two components:
In essence, the PSF objective is to represent the human body as a collection of coordinates for each body part in a given input image. PSF uses nonlinear joint regressors, ideally a two-layered random forest regressor.
These models work well when the input image has clear and visible limbs, however, they fail to capture and model limbs that are hidden or not visible from a certain angle.
To overcome these issues, feature building methods like histogram oriented gaussian (HOG), contours, histograms, etc., were used. In spite of using these methods, the classical model lacked accuracy, correlation, and generalization capabilities, so adopting a better approach was just a matter of time.
Deep learning-based approaches are well defined by their ability to generalize any function (if a sufficient number of nodes are present in the given hidden layer).
When it comes to computer vision tasks, deep convolutional neural networks (CNN) surpass all other algorithms, and this is true in HPE as well.
CNN has the ability to extract patterns and representations from the given input image with more precision and accuracy than any other algorithm; this makes CNN very useful for tasks such as classification, detection, and segmentation.
Unlike the classical approach, where the features were handcrafted; CNN can learn complex features when provided with enough training data.
Toshev et al in 2014 initially used the CNN to estimate human pose, switching from the classical-based approach to the deep learning-based approach, and they named it DeepPose: Human Pose Estimation via Deep Neural Networks.
In the paper that they had released, they defined the whole problem as a CNN-based regression problem towards body joints.
The authors also proposed an additional method where they implemented the cascade of such regressors in order to get more precise and consistent results. They argued that the proposed Deep Neural Network can model the given data in a holistic fashion, i.e. the network has the capability to model hidden poses, which was not true for the classical approach.
With strong and promising results shown by DeepPose, the HPE research naturally gravitated towards the deep learning-based approaches.
As the research and development started to take off in HPE, it brought forth new challenges.
One of them was to tackle the multi-person pose estimation.
DNNs are very proficient in estimating single human pose but when it comes to estimating multi-human they struggle because:
In order to tackle these problems, the researchers introduced two approaches:
Now, let's have a look at deep learning models that are used for multi-human pose estimation.
OpenPose was proposed by Zhe Cao et. al. in 2019.
It is a bottom-up approach where the network first detects the body parts or key points in the image, followed by mapping appropriate key points to form pairs.
OpenPose also uses CNN as its main architecture. It consists of a VGG-19 convolutional network that is used to extract patterns and representations from the given input. The output from the VGG-19 goes into two branches of convolutional networks.
The first network predicts a set of the confidence map for each body part while the second branch predicts a Part Affinity Fields (PAFs) which creates a degree of association between parts. It is also useful to prune the weaker links in the bipartite graphs.
The image above shows the architecture of OpenPose, which is a multi-stage CNN.
Essentially the predictions from the two branches, along with the features, are concatenated for the next stage to form a human skeleton depending upon the number of humans present in the input. Successive stages of CNNs are used to refine the prediction.
The image above describes the overall pipeline of OpenPose.
Regional Multi-person Pose Estimation (RMPE) or AlphaPose implements a top-down approach to HPE.
The top-down approach to HPE raises a lot of error in localization and inaccuracies during prediction and is, therefore, quite challenging.
For instance, the image above shows two bounding boxes, the red box represents ground truth while the yellow box represents the predicted bounding box.
Although, when it comes to classification, the yellow bounding box will be considered as a “correct” bounding box to classify a human. However, the human pose can not be estimated even with the “correct” bounding box.
The authors of AlphaPose tackled this issue of imperfect human detection with a two-step framework. In this framework, they introduced two networks:
The objective of AlphaPose is to extract a high-quality single-person region from an inaccurate bounding box by attaching SSTN to the SPPE. This method increases classification performances by tackling invariance while providing a stable framework to estimate human pose.
Pro tip: Read 9 Essential Features for a Bounding Box Annotation Tool to choose the right bounding box tool for your needs.
DeepCut was proposed by Leonid Pishchulin et. al. in 2016 with the objective of jointly solving the tasks of detection and pose estimation simultaneously.
It is a bottom-up approach to estimate human pose.
The idea was to detect all possible body parts in the given image, then label them such as a head, hands, legs, etc., followed by the process of separating the body parts belonging to each person.
The network uses Integral Linear Programming (ILP) modeling to implicitly group all the detected key points in the given input such that the resulting output resembles a skeleton representation of the human.
Mask R-CNN is a very popular algorithm for instance segmentation.
The model has the capability to simultaneously localize and classify objects by creating a bounding box around the object and also by creating a segmentation mask.
The basic architecture can be easily extended for Human Pose Estimation tasks.
Fast R-CNN uses CNN to extract features and representation from the given input.
The extracted features are then used to propose where the object might be present through a Region Proposal Network (RPN).
Since the bounding box can be of various sizes like in the image above, a layer called RoIAlign is used to normalize the extracted features so that they are all of the uniform sizes.
The extracted features are passed into the parallel branches of the network to refine the proposed region of interest (RoI) to generate bounding boxes and the segmentation masks.
When it comes to human pose estimation, the mask segmentation output yielded by the network can be used to detect humans in the given input. Because mask segmentation is very precise in object detection, in this case - human detection, the human pose can be estimated quite easily.
This method resembles the top-down approach, where the person detection stage is performed in parallel to the part detection stage.
In other words, the keypoint detection stage and person detection stage are independent of each other.
Human pose estimation has a variety of real-life applications so let's have a look now at some of the most common HPE use cases.
Maintaining physical well-being has become an integral part of our life these days and having a good trainer can help reach our desired fitness level.
It's no surprise that the market has become saturated with apps that harness the power of AI to help people work out better.
For instance, Zenia is an AI-powered yoga app that uses HPE to guide you towards achieving a proper posture during your yoga workouts. It uses the camera to detect your pose and estimates how accurate your pose is—if it is correct, then the predicted pose will be represented in green, just like in the image above. If the pose isn't correct, the red color will replace the green one.
Apart from yoga, HPE has also found application in other forms of exercise.
For example, it is now commonly used in weight lifting, where it can guide app users to perform a proper weight-lift by searching for common mistakes and providing insights on how to fix them to prevent injuries.
Robotics has been one of the fastest-growing areas of development.
While programming a robot to follow a procedure can be tedious and time-consuming, deep learning approaches can come to the rescue.
Techniques such as reinforcement learning use a simulated environment to achieve the accuracy level required to perform a certain task and can be successfully used to train a robot.
Another interesting application of HPE can be CGI.
The entertainment sector, specifically, the cinema business, spends tons of cash to create computer-generated graphics for special effects, mysterious creatures, out-of-the-world sceneries, and a lot more.
CGI is expensive because it requires a lot of effort—like wearing special suits and masks to capture the motions, creating superficial effects in the estimated pose, processing power, and also large time investments on top of that.
HPE can automatically extract key points from 2d input and create 3d rendering of the same, which can then be used to add effects, animations, and whatnot.
These days, almost all sports rely heavily on data analysis.
Pose detection can help players to improve their technique and achieve better results. Apart from that, pose detection can be used to analyze and learn about the strength and weaknesses of the opponent, which is invaluable for professional athletes and their trainers.
Another interesting application of pose estimation comes down to in-game applications, where players can make use of the motion capturing capabilities of HPE to inject poses into the gaming environment. The goal is to create an interactive gaming experience.
For example, Microsoft’s Kinect uses 3D pose estimation (using IR sensor data) to track the motion of the players and to use it to render the actions of the characters virtually into the gaming environment.
HPE can also be used for the analysis of infant motion. This is very helpful for analyzing the behavior of the baby as it grows, especially in assessing the course of its physical development.
In some situations, infants are born with serious health issues related to muscles, joints, and nervous system, some of which are caused by cerebral palsy, movement disorders, or traumatic injuries.
Motion analysis can help identify which muscles or joints are not working properly. Pose estimation can pick out subtle anomalies in the movement of the infant, which the doctors can analyze and come up with a suitable treatment. HPE can also be used as a recommendation tool to improve physical abilities so that the child can have the greatest level of independence.
Deep learning algorithms need proper evaluation metrics to learn the distribution well during the training and also to perform well during the inference. Evaluation metrics depend upon the tasks at hand.
In this section, we will briefly discuss the four evaluation metrics required for HPE.
PCP is used to measure the correct detection of limbs. If the distance between the two predicted joint locations and the true limb joint locations is almost less than half of the limb length then the limb is considered detected. However, sometimes it penalizes shorter limbs, for example, a lower arm.
In order to fix the issue raised by PCP, a new metric was proposed. It measures the distance between the predicted and the true joint within a certain fraction of the torso diameter and it is called the percentage of detected joints (PDJ).
PDJ helps to achieve localization precision, which alleviates the drawback of PCP since the detection criteria for all joints are based on the same distance threshold.
PCK is used as an accuracy metric that measures if the predicted keypoint and the true joint are within a certain distance threshold. The PCK is usually set with respect to the scale of the subject, which is enclosed within the bounding box.
The threshold can either be:
OKS is commonly used in the COCO keypoint challenge as an evaluation metric. It is defined as:
Because OKS is used to calculate the distance (0-1), it shows how close a predicted keypoint is to the true keypoint.
Here are some of the most prominent research papers regarding various HPE approaches.
Human Pose Estimation (HPE) is a way of extracting the pose of the human(s), usually in a form of a skeleton, from a given input: an image or a video.
It's a fascinating and a rapidly growing field of research that finds applications in a variety of industries. Some of the most common use cases for HPE include sports coaching, computer games, healthcare, and more.
If you'd like to get some hands-on experience with Human Pose Estimation, consider annotating your data with tools like V7 that offer keypoint skeleton annotation feature you can try out for free.
💡 Read more: