Computer vision

Human Activity Recognition (HAR): Fundamentals, Models, Datasets

20 min read

Mar 27, 2023

Dive into the state-of-the-art of Human Activity Recognition (HAR) and discover real-life applications plus datasets to try out.

Deval Shah

Deval Shah

Human Activity Recognition (HAR) is an exciting research area in computer vision and human-computer interaction.

Automatic detection of human physical activity has become crucial in pervasive computing, interpersonal communication, and human behavior analysis.

The broad usage of HAR benefits human safety and general well-being. Health monitoring can be done through wearable devices tracking physical activity, heart rate, and sleep quality. In smart homes, HAR-based solutions allow for energy saving and personal comfort by detecting when a person enters or leaves a room and adjusting the lighting or temperature. Personal safety devices can automatically alert emergency services or a designated contact. And that’s just the tip of the iceberg.

With multiple publicly available datasets, finding ready-to-use data for study and development purposes is very simple.

In this post, you’ll learn more about HAR’s current state-of-the-art, along with deep learning methods and machine learning models best suited for the task.

Here’s what we’ll cover:

  • What is Human Activity Recognition?

  • How does HAR work?

  • HAR models

  • Human activity recognition applications

  • Human activity recognition datasets

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

And if you're ready to jump straight into labeling data and training your AI models, make sure to check out:

  1. V7 Annotation

  2. V7 Model Training

  3. V7 Dataset Management

What is Human Activity Recognition (HAR)?

Human Activity Recognition (HAR) is a branch of computational science and engineering that tries to create systems and techniques capable of automatically recognizing and categorizing human actions based on sensor data. It is the capacity to use sensors to interpret human body gestures or motion and determine human activity or movement.

HAR systems are typically monitored or unsupervised and can be utilized in various applications, including wellness, athletics, healthcare, security, sports performance, etc.

While modeling, the HAR system's objective is to forecast the label of a person's action out of an image or video, which is commonly conducted through video-based activity recognition and image-based activity recognition.

Read more: Image Recognition: Definition, Algorithms & Uses

Pose estimation is used by one of the most common vision-based HAR systems. Researchers employ it more and more frequently as they reveal essential information about human behavior.

Pro tip: Check our guide to Human Pose Estimation

This helps in tasks such as HAR, content extraction, semantic comprehension, etc. It makes use of various DL approaches, especially convolutional neural networks.

One of HAR’s biggest challenges is taking the physical attributes of humans, cultural markers, direction, and the type of poses into consideration. For example, let’s take a look at the image below. It may be hard to predict whether the person is falling or attempting a handstand. This uncertainty encourages the use newer methods within the artificial intelligence framework.

Multi-modal learning and graph-based learning aim to improve the accuracy and robustness of HAR systems by incorporating more complex features, utilizing multiple data sources, and capturing the spatial and temporal relationships between body parts.

Some of the other HAR challenges include:

  • disparity in sensor data due to gadget placement  

  • movement variation

  • interference of activities that overlap

  • noisy data that causes distortions

  • time-consuming and expensive data collection methods

How does Human Activity Recognition work?

Human Activity Recognition framework

Human Activity Recognition framework

One of the critical objects of study in the scientific fields of computer vision and machine learning is the human ability to perceive the activities of others. Here are the basic steps involved in every task.

1. Data collection

The data for HAR is usually acquired by sensors attached to or worn by the user. Standard HAR sensors include accelerometers, gyroscopes, magnetometers, and GPS sensors.

Accelerometers can detect changes in movement and direction and quantify velocity across three axes (x, y, and z). Magnetometers can sense magnetic fields and order, whereas gyroscopes can measure rotations and angular velocity. GPS sensors are capable of helping track a user's whereabouts and movements, although they are less typically employed for HAR because of their substantial electricity consumption and limited indoor precision. Sensor data is often captured as time-series data, for each sample reflecting sensor measurements at a specific point in time (e.g., every second).

2. Data pre-processing

Data preprocessing is an essential stage in Human Activity Recognition (HAR) since it cleans, transforms, and prepares raw sensor data for future analysis and modeling. Some standard preparation processes include:

  1. Filtering: Filtering is a signal processing technique for removing noise and undesirable signals from raw sensor data. Depending on the frequency range of the signs of interest, typical filters used during HAR include low-pass filters, high-pass filters, and band-pass filters for noise suppression and image enhancement.

  1. Feature extraction: The features used are determined by the type of action and the sensor modality. Accelerometer data, for example, can be used to extract features such as mean, standard deviation, and frequency-domain properties, such as Fourier transformation and wavelet transformation parameters.

  1. Feature selection: The process of selecting features is used to minimize the feature space's degree of dimensionality and increase the precision and effectiveness of activity identification algorithms. This entails deciding on the most relevant characteristics based on their exclusionary ability, association with activity labeling, and redundancies with other features.

  1. Segmentation: To extract the temporal aspects of the activities, segmentation requires separating the sensor information into more compact segments or windows. The size and overlap of the window are determined by the duration and intensity of the activity being watched. After that, the segmented data is used to compute the characteristics of each window.

  1. Normalization: Normalization is the process of scaling features to have a neutral mean and variance of 1 to guarantee that they are similar across sensors and participants.

  1. Dimensionality reduction: Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are dimensionality reduction techniques that have the potential to minimize the feature space's degree of dimensionality and remove redundant or irrelevant features.

  1. Missing Value Imputation: Imputation is about filling in incomplete sensor data. The incompletion may happen due to device malfunction or data transmission faults. Simple imputation approaches can be utilized for missing values, including mean or median interpolation.

Data preparation is a crucial stage in HAR since it affects the precision and dependability of activity identification models.

3. Model selection

Several machine learning algorithms may be used to recognize human activities. The choice should depend on data complexity, available resources, and performance criteria. Here are some popular HAR machine learning models:

  1. Decision trees: Decision tree algorithms are straightforward models that deal with non-linear interactions among features and labels. They can be used for classification tasks in Human Activity Recognition based on sensor data such as accelerometers or gyroscope readings. Decision trees are easy to interpret and can handle both continuous and categorical data, making them useful for gaining insights into the most important features of a given classification task. However, they may suffer from overfitting and fall short in scenarios where the input data is highly complex or noisy.

  2. Random forest: Random forests are decision tree ensembles that can manage noisy and high-dimensional data. They resist overfitting and can deal with missing values. On the other hand, random forests may take more computational resources than decision trees and might need to perform better on tiny datasets.

  3. Support Vector Machines: SVMs are robust models that deal with nonlinear and linear data. They can deal with high-dimensional data while being less susceptible to overfitting. However, they may need careful hyperparameter tweaking and can be computationally costly with massive datasets.

  4. Hidden Markov Models: HMM is a statistical model used in HAR to recognize sequential patterns in sensor input. HMMs are very useful for time-series data and may be effective for complex activities with several steps.

  5. Convolutional Neural Networks (CNNs): CNNs are deep learning algorithms well-suited for picture and time-series data, such as gyroscope and accelerometer data. These algorithms can efficiently handle hierarchical features from raw data and manage complex data patterns but may need more computation power than other models and are prone to overfitting.

  6. Recurrent Neural Networks (RNNs): RNNs are deep learning models that handle sequential data such as time series. They can deal with variable-length sequences and detect temporal connections in data. However, they may struggle with the vanishing gradient issue and require careful initialization and regularization.

4. Model deployment

Human Activity Recognition (HAR) systems are deployed using one of two methods:

  1. External sensing deployment: In this method, external sensors (including cameras or motion detectors) are placed in the surroundings to collect information on human activities. A HAR model running on a different computing machine processes the sensor data. This method is excellent for monitoring actions in public places or when the person being tracked cannot wear a gadget.

  1. On-body sensing deployment: Here, the sensors (such as a wrist-wear accelerometer) are worn by the person being observed to capture information about human activities. A HAR model, possibly locally on the smartwatch or a distant computing system, processes the sensor data. This method effectively monitors performance in private locations or when the person being monitored can wear a gadget.


Pro tip: Check out our detailed guide to keypoint annotation

Deep neural network models for Human Activity Recognition

HAR is a complex subject for study in the discipline of computer vision. Researchers worldwide have been working on constructing a near-perfect recognition system for a long time.

For example, a paper by J. Gao et al. compares the performance of deep learning algorithms (such as Convolutional Neural Networks and Recurrent Neural Networks) to classical machine learning methods (such as Support Vector Machines and Random Forests) in HAR tasks.

The study finds that deep learning algorithms outperform classical machine learning methods in terms of accuracy, robustness to variations in sensor data, and the ability to learn complex features automatically from raw data. The study also notes that deep learning algorithms can be computationally efficient and implemented on low-power devices for real-time HAR applications.

DL models can accommodate fluctuations in sensor placement, inclination, and other environmental conditions that alter sensor signals, making them more resilient to real-world circumstances. DL models are also extensible and capable of handling big datasets containing millions of observations, which is especially beneficial for HAR applications involving several sensors and supports multiple users.

Besides this, deep learning algorithms excel in processing time-series data to classify and extract features, leveraging local dependencies. Researchers are increasingly interested in using sophisticated deep learning approaches such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and hybrid models to recognize human activities better.

Furthermore, DL models can develop end-to-end models that map sensor data directly to activity labels, eliminating the need for intermediate processes like segmentation and extraction of features.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a form of deep learning model that works well with sequential input, particularly in Human Activity Recognition situations where the input data is time-series data from sensors.

The input data is initially turned into a sequence of fixed-length feature vectors in HAR using RNNs, with each vector indicating a time window of sensor data. The feature vector sequence is then passed into the RNN, which successively examines each input vector while keeping a hidden state that retains the temporal connections between input vectors.

The ability of RNNs to detect long-term temporal dependencies in input data is their primary benefit for HAR. This is performed by employing recurrent connections between the RNN's hidden states. The recurrent connections let the RNN keep an internal recollection of prior inputs, which helps it recognize complicated patterns of activity that span numerous time frames.

RNNs have demonstrated encouraging results in HAR, with excellent precision and durability in recognizing complicated operations, such as athletic movements, home activities, and fall detection. It can also handle variable-length input sequences, making them well enough for practical uses where activity duration varies. The limitations include disappearing and exploding gradient problem, which can impact the training process.

Long Short-Term Memory

Long Short-Term Memory (LSTM) is a form of Recurrent Neural Network (RNN) which has been effectively used for a variety of sequential data-related tasks, including Human Activity Recognition (HAR).

LSTM models, like other RNNs, are designed to analyze data sequences and save internal memories of prior inputs, enabling them to retain the temporal connections between different sections of the sequence.

The main benefit of LSTMs over all other RNNs is their capacity to forget or retain information from previous time steps consciously. This aids in solving the issue of vanishing gradients, which frequently occur in regular RNNs. LSTMs can effectively simulate long-term dependencies inside the input sequence. They’re well-suited for complicated HAR tasks such as identifying anomalies and recognizing complex human actions.

LSTM-based models demonstrated significant gains in HAR tasks in various benchmark datasets, attaining state-of-the-art performance. They have also shown resilience in detecting complicated activities and dealing with variable-length input sequences. However, just like other models based on deep learning, LSTMs have several drawbacks for HAR: the requirement for vast volumes of labeled data, computational cost, and model interpretability.

RNN-LSTM basic outline

Long Short-Term Memory (LSTM) is a form of Recurrent Neural Network (RNN) which has been effectively used for a variety of sequential data-related tasks, including Human Activity Recognition (HAR).

LSTM models, like other RNNs, are designed to analyze data sequences and save internal memories of prior inputs, enabling them to retain the temporal connections between different sections of the sequence.

The main benefit of LSTMs over all other RNNs is their capacity to forget or retain information from previous time steps consciously. This aids in solving the issue of vanishing gradients, which frequently occur in regular RNNs. LSTMs can effectively simulate long-term dependencies inside the input sequence. They’re well-suited for complicated HAR tasks such as identifying anomalies and recognizing complex human actions.

LSTM-based models demonstrated significant gains in HAR tasks in various benchmark datasets, attaining state-of-the-art performance. They have also shown resilience in detecting complicated activities and dealing with variable-length input sequences. However, just like other models based on deep learning, LSTMs have several drawbacks for HAR: the requirement for vast volumes of labeled data, computational cost, and model interpretability.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a deep learning architecture that excels at processing image and video data. CNNs have been utilized in the setting of Human Activity Recognition (HAR) to automatically and reliably detect and classify human actions from sensor data.

The input data for HAR utilizing CNNs is often time-series data acquired by sensors. The time-series data is first transformed into a 2D image-like format, with time as the x-axis and sensor data as the y-axis.

The generated data matrix is then input into the CNN for the extraction and classification of features. Using a sliding window technique, CNN's convolutional layers apply filters to the incoming data. At different points in the input data, each filter takes a certain feature from it, including edges or corners.

The result of the convolutional layers is then passed into the pooling layers, which downsample the retrieved features while maintaining their crucial spatial correlations. The pooling layers' output is then smoothed and passed into fully connected layers that classify the retrieved features into distinct human activities. The output of the fully linked layers is then fed into a softmax function, which generates a probability distribution over the various activities.

The image below, taken from this paper, gives us an idea of how CNN’s basic framework works.

CNN basic outline

CNNs have the advantage of handling input information of different sizes and forms, making them well enough to interpret sensor data from various devices. Furthermore, CNNs may learn hierarchical feature representations of data input, allowing them to acquire low-level and high-level elements essential to human activity identification.

Pro tip: Looking for a source to recap activation functions? Check out Types of Neural Networks Activation Functions

Applications and uses of Human Activity Recognition

Human Activity Recognition is already used in multiple fields, with new applications appearing all the time. Let’s go through a few flagship examples.

Applications of Human Activity Recognition

Applications of Human Activity Recognition (source)

Sports performance analysis

Human Activity Recognition (HAR) can analyze sports performance in various ways. It may be utilized to track and analyze athletes' movements during competition and training, anticipate new injury risks, assess the effectiveness of different training programs, follow individual athletes' growth, and examine team sports' tactical and strategic components.

For example, HAR can be used to analyze badminton players' movements during attempting to hit and smash, track runners' movements and identify possible overuse injuries, monitor soccer players' performance during a game, track tennis players' movements throughout a match and identify the areas for enhanced foot movement and positioning, or analyze basketball players' actions during a game to recognize possibilities to improve team defense and ball movement.

Keypoint annotations in the V7 tool

Pro tip: Check out 7 Game-Changing AI Applications in the Sports Industry

Self-driving cars

Human Activity Recognition (HAR) has numerous uses in self-driving cars. HAR may be employed to detect people and other vehicles on the road, increasing the effectiveness and security of self-driving automobiles.

HAR, for example, may be utilized to identify and monitor the motions of pedestrians, bicycles, and other automobiles in the environment, allowing self-driving cars to predict and prevent collisions.

HAR can also recognize driver behavior, such as hand signals and head movements, which can help self-driving cars communicate with human drivers.

Pro tip: Check out 9 Revolutionary AI Applications In Transportation

Human/computer interaction

Human Activity Recognition can be used to identify and classify human gestures and movements, which can be utilized to improve computer system usability and accessibility.

HAR can be used to enable gesture-based commands of electronic devices like smartphones and smart TVs, resulting in an even more natural and easily understood user interface. HAR can also provide voice-based automation of computer systems, such as virtual personal assistants and chatbots, allowing for more practical and effective communication with computers.

Furthermore, HAR can monitor computer users' health and wellness by identifying and categorizing their physical movements and behaviors, which can help prevent and reduce the harmful impacts of prolonged computer use, including eye strain, back pain, etc.

Gaming

Human Activity Recognition has several uses in the gaming industry. HAR is capable of helping recognize and classify various player actions and gestures, allowing for more immersive and participatory gaming experiences.

For instance, HAR may enable motion-controlled gaming, translating the player's movements and gestures into in-game activities such as swinging a sword or throwing a ball. HAR can also provide gesture-based manipulation of in-game panels and settings, making navigating the game more convenient and intuitive.

Furthermore, HAR can track a player's physical exercise and motions while playing. A game, for example, may compensate the player for completing a certain amount of steps or executing a particular workout.

Smart surveillance

As it permits automatic video analysis and interpretation, HAR has become an increasingly relevant tool in smart surveillance. It can improve the protection and security of public areas and vital infrastructure.

HAR can recognize and classify human activities like walking, running, loitering, and even suspicious actions such as carrying weapons or goods. This system can detect anomalous or repetitive activity patterns, such as lingering in a closed area or leaving an object unattended, and send notifications to security officers.

Furthermore, in real-time, HAR may identify persons, particularly in crowded locations, by assessing their stride, stance, and other physical traits, even if the face is concealed or covered. This system can also follow people throughout the surveillance area, allowing security officers to find and track prospective suspects. However, it presents privacy concerns, which must be handled with suitable legislation and protections.

Human Activity Recognition datasets

Let’s review a few of HAR's most important ready-to-use datasets.

Pro tip: Looking for quality datasets to train your models? Check out our collection of 500+ open datasets.

Kinetics-700

A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips covering 700 human action classes. The videos include human-object interactions, as well as human-human interactions. The Kinetics dataset is great for training human action recognition models.

Volleyball action recognition dataset

Volleyball is a video action recognition dataset. It has 4830 annotated frames handpicked from 55 videos with nine player action labels and eight team activity labels. It contains group activity annotations as well as individual activity annotations.

ARID Dataset

pictures from arid dataset

The Action Recognition in the Dark (ARID) dataset is a benchmark dataset for action recognition in low-light conditions. It includes over 3,780 video clips featuring 11 action categories, making it the first dataset focused on human actions in dark videos. The ARID dataset is an important resource for researchers and practitioners working on improving action recognition algorithms in challenging lighting conditions.

DAHLIA - Daily Human Life Activity

DAHLIA dataset is focused on human activity recognition for smart-home services, such as user assistance.

Videos were recorded in realistic conditions, with 3 Kinect v2 sensors located as they would be in a real context. The long-range activities were performed in an unconstrained way (participants received only a few instructions) and in a continuous (untrimmed) sequence, resulting in long videos (40 min on average per subject).

Human Activity Recognition Using Smartphones Data Set

The Human Activity Recognition Using Smartphones Data Set is a publicly available dataset that contains sensor readings from a smartphone's accelerometer and gyroscope captured during six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

The dataset includes 3-axial linear acceleration and 3-axial angular velocity measurements captured at a constant rate of 50Hz. The sensor data was collected from 30 volunteers wearing a Samsung Galaxy S II smartphone on their waist while performing the activities. Each volunteer was asked to perform each activity for approximately 2-3 minutes, resulting in 10,299 instances.

Final thoughts

Human Activity Recognition (HAR) is an intriguing technology with many applications. HAR recognizes and classifies human activities and movements using machine-learning techniques and sensors. It can transform various sectors, including healthcare, sports performance analysis, gaming, intelligent monitoring, and human/computer interface.

Yet, to meet ethical and privacy concerns, the creation and execution of HAR must be done ethically and transparently. It is essential to guarantee that the data utilized to train and assess HAR algorithms is representative, diverse, and unbiased.

The article also discussed how deep learning-based HAR approaches outperform conventional algorithms for machine learning, highlighting the outlines of CNN, CNN-LSTM, and RNN.

In conclusion,  HAR has the potential to alter our daily lives and have a significant beneficial effect on society as it evolves and improves.

References

  1. Arshad, M. H., Bilal, M., & Gani, A. (2022). Human Activity Recognition: Review, Taxonomy, and Open Challenges. Sensors, 22(17), 6463.

  2. Bhattacharya, D., Sharma, D., Kim, W., Ijaz, M. F., & Singh, P. K. (2022). Ensem-HAR: An ensemble deep learning model for smartphone sensor-based human activity recognition for measurement of elderly health monitoring. Biosensors, 12(6), 393.

  3. Gupta, N., Gupta, S. K., Pathak, R. K., Jain, V., Rashidi, P., & Suri, J. S. (2022). Human activity recognition in artificial intelligence framework: A narrative review. Artificial intelligence review, 55(6), 4755-4808.

  4. Jobanputra, C., Bavishi, J., & Doshi, N. (2019). Human activity recognition: A survey. Procedia Computer Science, 155, 698-703.

  5. Song, L., Yu, G., Yuan, J., & Liu, Z. (2021). Human pose estimation and its application to action recognition: A survey. Journal of Visual Communication and Image Representation, 76, 103055.

  6. Yao, Y. (n.d.). Human activity recognition is based on recurrent neural networks. Yu's Website. Retrieved March 3, 2023, from https://moonblvd.github.io/brianyao_hugo/project/lstm/

  7. Zeng, M., Nguyen, L. T., Yu, B., Mengshoel, O. J., Zhu, J., Wu, P., & Zhang, J. (2014, November). Convolutional neural networks for human activity recognition using mobile sensors. In 6th international conference on mobile computing, applications and services (pp. 197-205). IEEE.

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

Deval Shah

Deval Shah

Deval Shah

Deval Shah

Deval is a senior software engineer at Eagle Eye Networks and a computer vision enthusiast. He writes about complex topics related to machine learning and deep learning.

Next steps

Label videos with V7.

Rewind less, achieve more.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Rewind less, achieve more.