A Comprehensive Guide to Human Pose Estimation

Here's everything you need to know about Human Pose Estimation and its real-world applications. Learn about different HPE approaches and check out the most prominent research papers in this field.

Pose estimation is a computer vision task that is usually tackled through deep learning.

It is one of the most interesting areas of research that has gained a lot of traction because of its usefulness and versatility—it finds applications in a wide range of fields including gaming, healthcare, AR, and sports.

This article will give you a comprehensive overview of what human pose estimation is and how it works. We promise to keep things simple 😉

It will also cover different HPE approaches—both classical and deep learning-based methods, metrics and evaluation techniques, and much more.

After reading this article, you'll learn:

  1. What is Human Pose Estimation?
  2. Classical vs Deep Learning-based approaches
  3. Human Pose Estimation using Deep Neural Networks
  4. Evaluation metrics for the Human Pose Estimation model
  5. Top 10 Research Papers on Human Pose Estimation
  6. 6 Human Pose Estimation applications

And If you prefer to get hands-on experience annotating data for your Human Pose Estimation projects, make sure to check out the video below.

Now, let's dive in.

What is Human Pose Estimation?

Human Pose Estimation (HPE) is a way of identifying and classifying the joints in the human body.

Essentially it is a way to capture a set of coordinates for each joint (arm, head, torso, etc.,) which is known as a key point that can describe a pose of a person. The connection between these points is known as a pair.

The connection formed between the points has to be significant, which means not all points can form a pair. From the outset, the aim of HPE is to form a skeleton-like representation of a human body and then process it further for task-specific applications.


There are three types of approaches to model the human body:

  1. Skeleton-based model
  2. Contour-based model
  3. Volume-based model
Human body models

Classical vs. Deep Learning-based approaches

HPE approaches are primarily in the area of computer vision, and it is used to understand geometric and motion information of the human body, which can be very intricate.

This section explores the two approaches: the classical approach and the deep learning-based approach to HPE. We will also explain how classical approaches fail to capture the geometric and motion information of the human body, and how deep learning algorithms such as the CNNs excel at it.

Classical approaches to 2D Human Pose Estimation

Classical approaches usually refer to techniques and methods involving swallow machine learning algorithms.

For instance, the earlier work to estimate human pose included the implementation of random forest within a “pictorial structure framework”.  This was used to predict joints in the human body.

The pictorial structure framework (PSF) is commonly referred to as one of the traditional methods to estimate human pose. PSF contained two components:

  1. Discriminator: It models the likelihood of a certain part present at a particular location. In other words, it identifies the body parts.
  2. Prior: It is referred to as modeling the probability distribution over pose using the output from the discriminator; the modeled pose should be realistic.

In essence, the PSF objective is to represent the human body as a collection of coordinates for each body part in a given input image. PSF uses nonlinear joint regressors, ideally a two-layered random forest regressor.

Pictorial Structure Framework

These models work well when the input image has clear and visible limbs, however, they fail to capture and model limbs that are hidden or not visible from a certain angle.

To overcome these issues, feature building methods like histogram oriented gaussian (HOG), contours, histograms, etc.,  were used. In spite of using these methods, the classical model lacked accuracy, correlation, and generalization capabilities, so adopting a better approach was just a matter of time.

Deep Learning-based approaches to 2D Human Pose Estimation

Deep learning-based approaches are well defined by their ability to generalize any function (if a sufficient number of nodes are present in the given hidden layer).

When it comes to computer vision tasks, deep convolutional neural networks (CNN) surpass all other algorithms, and this is true in HPE as well.  

CNN has the ability to extract patterns and representations from the given input image with more precision and accuracy than any other algorithm; this makes CNN very useful for tasks such as classification, detection, and segmentation.

Unlike the classical approach, where the features were handcrafted; CNN can learn complex features when provided with enough training data.

💡 Pro tip: Looking for quality datasets to train your models? Check out 65+ Best Free Datasets for Machine Learning.

Toshev et al in 2014 initially used the CNN to estimate human pose, switching from the classical-based approach to the deep learning-based approach, and they named it DeepPose: Human Pose Estimation via Deep Neural Networks.

In the paper that they had released, they defined the whole problem as a CNN-based regression problem towards body joints.

The authors also proposed an additional method where they implemented the cascade of such regressors in order to get more precise and consistent results. They argued that the proposed Deep Neural Network can model the given data in a holistic fashion, i.e. the network has the capability to model hidden poses, which was not true for the classical approach.

With strong and promising results shown by DeepPose, the HPE research naturally gravitated towards the deep learning-based approaches.

Human Pose Estimation using Deep Neural Networks

As the research and development started to take off in HPE, it brought forth new challenges.

One of them was to tackle the multi-person pose estimation.

DNNs are very proficient in estimating single human pose but when it comes to estimating multi-human they struggle because:

  1. An image can contain multiple numbers of people in different positions.
  2. As the number of people increases, the interaction between increases leads to computational complexities.
  3. An increase in computational complexities often leads to an increase in inference time in real-time.

In order to tackle these problems, the researchers introduced two approaches:

  1. Top-down: Localize the humans in the image or video and then estimate the parts followed by calculating the pose.
  2. Bottom-up: Estimate the human body parts in the image followed by calculating the pose.

Now, let's have a look at deep learning models that are used for multi-human pose estimation.

💡 Pro tip: Check out 12 Types of Neural Networks Activation Functions to learn more.


OpenPose was proposed by Zhe Cao et. al. in 2019.

It is a bottom-up approach where the network first detects the body parts or key points in the image, followed by mapping appropriate key points to form pairs.

OpenPose also uses CNN as its main architecture. It consists of a VGG-19 convolutional network that is used to extract patterns and representations from the given input. The output from the VGG-19 goes into two branches of convolutional networks.

The first network predicts a set of the confidence map for each body part while the second branch predicts a Part Affinity Fields (PAFs) which creates a degree of association between parts. It is also useful to prune the weaker links in the bipartite graphs.

OpenPose Architecture

The image above shows the architecture of OpenPose, which is a multi-stage CNN.  

Essentially the predictions from the two branches, along with the features, are concatenated for the next stage to form a human skeleton depending upon the number of humans present in the input. Successive stages of CNNs are used to refine the prediction.

The image above describes the overall pipeline of OpenPose.

AlphaPose (RMPE)

Regional Multi-person Pose Estimation (RMPE) or AlphaPose implements a top-down approach to HPE.

The top-down approach to HPE raises a lot of error in localization and inaccuracies during prediction and is, therefore, quite challenging.

For instance, the image above shows two bounding boxes, the red box represents ground truth while the yellow box represents the predicted bounding box.

Although, when it comes to classification, the yellow bounding box will be considered as a “correct” bounding box to classify a human. However, the human pose can not be estimated even with the “correct” bounding box.

The authors of AlphaPose tackled this issue of imperfect human detection with a two-step framework. In this framework, they introduced two networks:

  1. Symmetric Spatial Transformer Network (SSTN): It ​​helps to crop out the appropriate region in the input, which subsequently simplifies the classification task leading to better performance.
  2. Single Person Pose Estimator (SPPE): It is used to extract and estimate human pose.

The objective of AlphaPose is to extract a high-quality single-person region from an inaccurate bounding box by attaching SSTN to the SPPE. This method increases classification performances by tackling invariance while providing a stable framework to estimate human pose.

Pro tip: Read 9 Essential Features for a Bounding Box Annotation Tool to choose the right bounding box tool for your needs.


DeepCut was proposed by Leonid Pishchulin et. al. in 2016 with the objective of jointly solving the tasks of detection and pose estimation simultaneously.

It is a bottom-up approach to estimate human pose.

The idea was to detect all possible body parts in the given image, then label them such as a head, hands, legs, etc., followed by the process of separating the body parts belonging to each person.

The network uses Integral Linear Programming (ILP) modeling to implicitly group all the detected key points in the given input such that the resulting output resembles a skeleton representation of the human.

Mask R-CNN

Mask R-CNN is a very popular algorithm for instance segmentation.

The model has the capability to simultaneously localize and classify objects by creating a bounding box around the object and also by creating a segmentation mask.

Mask R-CNN

The basic architecture can be easily extended for Human Pose Estimation tasks.

Fast R-CNN uses CNN to extract features and representation from the given input.

The extracted features are then used to propose where the object might be present through a Region Proposal Network (RPN).

Since the bounding box can be of various sizes like in the image above, a layer called RoIAlign is used to normalize the extracted features so that they are all of the uniform sizes.

The extracted features are passed into the parallel branches of the network to refine the proposed region of interest (RoI) to generate bounding boxes and the segmentation masks.

Mask R-CNN Architecture

When it comes to human pose estimation, the mask segmentation output yielded by the network can be used to detect humans in the given input. Because mask segmentation is very precise in object detection, in this case - human detection, the human pose can be estimated quite easily.

This method resembles the top-down approach, where the person detection stage is performed in parallel to the part detection stage.

In other words, the keypoint detection stage and person detection stage are independent of each other.

7 Human Pose Estimation applications

Human pose estimation has a variety of real-life applications so let's have a look now at some of the most common HPE use cases.

AI-powered personal trainers

Maintaining physical well-being has become an integral part of our life these days and having a good trainer can help reach our desired fitness level.

It's no surprise that the market has become saturated with apps that harness the power of AI to help people work out better.

For instance, Zenia is an AI-powered yoga app that uses HPE to guide you towards achieving a proper posture during your yoga workouts. It uses the camera to detect your pose and estimates how accurate your pose is—if it is correct, then the predicted pose will be represented in green, just like in the image above. If the pose isn't correct, the red color will replace the green one.

Apart from yoga, HPE has also found application in other forms of exercise.

For example, it is now commonly used in weight lifting, where it can guide app users to perform a proper weight-lift by searching for common mistakes and providing insights on how to fix them to prevent injuries.


Robotics has been one of the fastest-growing areas of development.

While programming a robot to follow a procedure can be tedious and time-consuming, deep learning approaches can come to the rescue.

Techniques such as reinforcement learning use a simulated environment to achieve the accuracy level required to perform a certain task and can be successfully used to train a robot.

Motion capture and augmented reality

Another interesting application of HPE can be CGI.

The entertainment sector, specifically, the cinema business, spends tons of cash to create computer-generated graphics for special effects, mysterious creatures, out-of-the-world sceneries, and a lot more.

CGI is expensive because it requires a lot of effort—like wearing special suits and masks to capture the motions, creating superficial effects in the estimated pose, processing power, and also large time investments on top of that.

HPE can automatically extract key points from 2d input and create 3d rendering of the same, which can then be used to add effects, animations, and whatnot.

Athlete pose detection

These days, almost all sports rely heavily on data analysis.

Pose detection can help players to improve their technique and achieve better results. Apart from that, pose detection can be used to analyze and learn about the strength and weaknesses of the opponent, which is invaluable for professional athletes and their trainers.

Motion tracking for gaming

Another interesting application of pose estimation comes down to in-game applications, where players can make use of the motion capturing capabilities of HPE to inject poses into the gaming environment. The goal is to create an interactive gaming experience.

For example, Microsoft’s Kinect uses 3D pose estimation (using IR sensor data) to track the motion of the players and to use it to render the actions of the characters virtually into the gaming environment.

Infant Motion Analysis

HPE can also be used for the analysis of infant motion. This is very helpful for analyzing the behavior of the baby as it grows, especially in assessing the course of its physical development.

In some situations, infants are born with serious health issues related to muscles, joints, and nervous system, some of which are caused by cerebral palsy, movement disorders, or traumatic injuries.

Motion analysis can help identify which muscles or joints are not working properly. Pose estimation can pick out subtle anomalies in the movement of the infant, which the doctors can analyze and come up with a suitable treatment. HPE can also be used as a recommendation tool to improve physical abilities so that the child can have the greatest level of independence.

Evaluation metrics for Human Pose Estimation model

Deep learning algorithms need proper evaluation metrics to learn the distribution well during the training and also to perform well during the inference. Evaluation metrics depend upon the tasks at hand.

In this section, we will briefly discuss the four evaluation metrics required for HPE.

Percentage of Correct Parts (PCP)

PCP is used to measure the correct detection of limbs. If the distance between the two predicted joint locations and the true limb joint locations is almost less than half of the limb length then the limb is considered detected. However, sometimes it penalizes shorter limbs, for example, a lower arm.

Percentage of Detected Joints (PDJ)

In order to fix the issue raised by PCP, a new metric was proposed.  It measures the distance between the predicted and the true joint within a certain fraction of the torso diameter and it is called the percentage of detected joints (PDJ).  

PDJ helps to achieve localization precision, which alleviates the drawback of PCP since the detection criteria for all joints are based on the same distance threshold.

Percentage of Correct Key-points (PCK)

PCK is used as an accuracy metric that measures if the predicted keypoint and the true joint are within a certain distance threshold. The PCK is usually set with respect to the scale of the subject, which is enclosed within the bounding box.

The threshold can either be:

  • PCKh@0.5 is when the threshold = 50% of the head bone link
  • PCK@0.2 = Distance between predicted and true joint < 0.2 * torso diameter
  • Sometimes 150 mm is taken as the threshold.
  • It alleviates the shorter limb problem since shorter limbs have smaller torsos and head bone links.
  • PCK is used for 2D and 3D (PCK3D)

Object Keypoint Similarity (OKS) based mAP

OKS is commonly used in the COCO keypoint challenge as an evaluation metric. It is defined as:


  • di is the euclidean distance between the ground truth and predicted keypoint
  • s is the square root of the object segment area
  • k is the per-keypoint constant that controls fall off.
  • vi is considered to be a visibility flag that can be 0, 1 or 2 for not labeled, labeled but not visible and visible and labeled respectively.

Because OKS is used to calculate the distance (0-1), it shows how close a predicted keypoint is to the true keypoint.

Top 10 Research Papers on Human Pose Estimation

Here are some of the most prominent research papers regarding various HPE approaches.

  1. DeepPose: Human Pose Estimation via Deep Neural Networks
  2. Convolutional Pose Machines
  3. RMPE: Regional Multi-Person Pose Estimation
  4. Efficient Object Localization Using Convolutional Networks
  5. DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation
  6. Simple Baselines for Human Pose Estimation and Tracking
  7. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
  8. Human Pose Estimation for Real-World Crowded Scenarios
  9. DensePose: Dense Human Pose Estimation In The Wild
  10. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

Human Pose Estimation in a Nutshell

Human Pose Estimation (HPE) is a way of extracting the pose of the human(s), usually in a form of a skeleton, from a given input: an image or a video.

It's a fascinating and a rapidly growing field of research that finds applications in a variety of industries. Some of the most common use cases for HPE include sports coaching, computer games, healthcare, and more.

If you'd like to get some hands-on experience with Human Pose Estimation, consider annotating your data with tools like V7 that offer keypoint skeleton annotation feature you can try out for free.

💡 Read more:

13 Best Image Annotation Tools of 2021 [Reviewed]

An Introductory Guide to Quality Training Data for Machine Learning

What is Data Labeling and How to Do It Efficiently [Tutorial]

Data Cleaning Checklist: How to Prepare Your Machine Learning Data

Nilesh Barla
Nilesh Barla

Nilesh Barla is the founder of PerceptronAI, which aims to provide solutions in medical and material science through deep learning algorithms. He studied metallurgical and materials engineering at the National Institute of Technology Trichy, India, and enjoys researching new trends and algorithms in deep learning.

Related posts

Upgrade to a new era of software

We're telling the stories of teams that pioneer neural networks to solve any visual task. You can join them by signing up to V7 - the only platform to develop AIs for aony computer vision use case, and monitor them in production.You'll be able to develop your own training data and models, or apply pre-existing AI models to solve new use cases.

Learn about V7

Ready to get started?

Schedule a demo with our team or discuss your project.

Dataset Management

AutoML model training to solve visual tasks or auto-label your datasets, and a scalable inference engine to launch your project.