Human Activity Recognition (HAR) is an exciting research area in computer vision and human-computer interaction.
Automatic detection of human physical activity has become crucial in pervasive computing, interpersonal communication, and human behavior analysis.
The broad usage of HAR benefits human safety and general well-being. Health monitoring can be done through wearable devices tracking physical activity, heart rate, and sleep quality. In smart homes, HAR-based solutions allow for energy saving and personal comfort by detecting when a person enters or leaves a room and adjusting the lighting or temperature. Personal safety devices can automatically alert emergency services or a designated contact. And that’s just the tip of the iceberg.
With multiple publicly available datasets, finding ready-to-use data for study and development purposes is very simple.
In this post, you’ll learn more about HAR’s current state-of-the-art, along with deep learning methods and machine learning models best suited for the task.
Here’s what we’ll cover:
Speed up labeling data 10x. Use V7 to develop AI faster.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
And if you're ready to jump straight into labeling data and training your AI models, make sure to check out:
Human Activity Recognition (HAR) is a branch of computational science and engineering that tries to create systems and techniques capable of automatically recognizing and categorizing human actions based on sensor data. It is the capacity to use sensors to interpret human body gestures or motion and determine human activity or movement.
HAR systems are typically monitored or unsupervised and can be utilized in various applications, including wellness, athletics, healthcare, security, sports performance, etc.
While modeling, the HAR system's objective is to forecast the label of a person's action out of an image or video, which is commonly conducted through video-based activity recognition and image-based activity recognition.
Pose estimation is used by one of the most common vision-based HAR systems. Researchers employ it more and more frequently as they reveal essential information about human behavior.
This helps in tasks such as HAR, content extraction, semantic comprehension, etc. It makes use of various DL approaches, especially convolutional neural networks.
One of HAR’s biggest challenges is taking the physical attributes of humans, cultural markers, direction, and the type of poses into consideration. For example, let’s take a look at the image below. It may be hard to predict whether the person is falling or attempting a handstand. This uncertainty encourages the use newer methods within the artificial intelligence framework.
Multi-modal learning and graph-based learning aim to improve the accuracy and robustness of HAR systems by incorporating more complex features, utilizing multiple data sources, and capturing the spatial and temporal relationships between body parts.
Some of the other HAR challenges include:
One of the critical objects of study in the scientific fields of computer vision and machine learning is the human ability to perceive the activities of others. Here are the basic steps involved in every task.
The data for HAR is usually acquired by sensors attached to or worn by the user. Standard HAR sensors include accelerometers, gyroscopes, magnetometers, and GPS sensors.
Accelerometers can detect changes in movement and direction and quantify velocity across three axes (x, y, and z). Magnetometers can sense magnetic fields and order, whereas gyroscopes can measure rotations and angular velocity. GPS sensors are capable of helping track a user's whereabouts and movements, although they are less typically employed for HAR because of their substantial electricity consumption and limited indoor precision. Sensor data is often captured as time-series data, for each sample reflecting sensor measurements at a specific point in time (e.g., every second).
Data preprocessing is an essential stage in Human Activity Recognition (HAR) since it cleans, transforms, and prepares raw sensor data for future analysis and modeling. Some standard preparation processes include:
Data preparation is a crucial stage in HAR since it affects the precision and dependability of activity identification models.
Several machine learning algorithms may be used to recognize human activities. The choice should depend on data complexity, available resources, and performance criteria. Here are some popular HAR machine learning models:
Human Activity Recognition (HAR) systems are deployed using one of two methods:
HAR is a complex subject for study in the discipline of computer vision. Researchers worldwide have been working on constructing a near-perfect recognition system for a long time.
For example, a paper by J. Gao et al. compares the performance of deep learning algorithms (such as Convolutional Neural Networks and Recurrent Neural Networks) to classical machine learning methods (such as Support Vector Machines and Random Forests) in HAR tasks.
The study finds that deep learning algorithms outperform classical machine learning methods in terms of accuracy, robustness to variations in sensor data, and the ability to learn complex features automatically from raw data. The study also notes that deep learning algorithms can be computationally efficient and implemented on low-power devices for real-time HAR applications.
DL models can accommodate fluctuations in sensor placement, inclination, and other environmental conditions that alter sensor signals, making them more resilient to real-world circumstances. DL models are also extensible and capable of handling big datasets containing millions of observations, which is especially beneficial for HAR applications involving several sensors and supports multiple users.
Besides this, deep learning algorithms excel in processing time-series data to classify and extract features, leveraging local dependencies. Researchers are increasingly interested in using sophisticated deep learning approaches such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and hybrid models to recognize human activities better.
Furthermore, DL models can develop end-to-end models that map sensor data directly to activity labels, eliminating the need for intermediate processes like segmentation and extraction of features.
Recurrent Neural Networks (RNNs) are a form of deep learning model that works well with sequential input, particularly in Human Activity Recognition situations where the input data is time-series data from sensors.
The input data is initially turned into a sequence of fixed-length feature vectors in HAR using RNNs, with each vector indicating a time window of sensor data. The feature vector sequence is then passed into the RNN, which successively examines each input vector while keeping a hidden state that retains the temporal connections between input vectors.
The ability of RNNs to detect long-term temporal dependencies in input data is their primary benefit for HAR. This is performed by employing recurrent connections between the RNN's hidden states. The recurrent connections let the RNN keep an internal recollection of prior inputs, which helps it recognize complicated patterns of activity that span numerous time frames.
RNNs have demonstrated encouraging results in HAR, with excellent precision and durability in recognizing complicated operations, such as athletic movements, home activities, and fall detection. It can also handle variable-length input sequences, making them well enough for practical uses where activity duration varies. The limitations include disappearing and exploding gradient problem, which can impact the training process.
Long Short-Term Memory (LSTM) is a form of Recurrent Neural Network (RNN) which has been effectively used for a variety of sequential data-related tasks, including Human Activity Recognition (HAR).
LSTM models, like other RNNs, are designed to analyze data sequences and save internal memories of prior inputs, enabling them to retain the temporal connections between different sections of the sequence.
The main benefit of LSTMs over all other RNNs is their capacity to forget or retain information from previous time steps consciously. This aids in solving the issue of vanishing gradients, which frequently occur in regular RNNs. LSTMs can effectively simulate long-term dependencies inside the input sequence. They’re well-suited for complicated HAR tasks such as identifying anomalies and recognizing complex human actions.
LSTM-based models demonstrated significant gains in HAR tasks in various benchmark datasets, attaining state-of-the-art performance. They have also shown resilience in detecting complicated activities and dealing with variable-length input sequences. However, just like other models based on deep learning, LSTMs have several drawbacks for HAR: the requirement for vast volumes of labeled data, computational cost, and model interpretability.
Long Short-Term Memory (LSTM) is a form of Recurrent Neural Network (RNN) which has been effectively used for a variety of sequential data-related tasks, including Human Activity Recognition (HAR).
LSTM models, like other RNNs, are designed to analyze data sequences and save internal memories of prior inputs, enabling them to retain the temporal connections between different sections of the sequence.
The main benefit of LSTMs over all other RNNs is their capacity to forget or retain information from previous time steps consciously. This aids in solving the issue of vanishing gradients, which frequently occur in regular RNNs. LSTMs can effectively simulate long-term dependencies inside the input sequence. They’re well-suited for complicated HAR tasks such as identifying anomalies and recognizing complex human actions.
LSTM-based models demonstrated significant gains in HAR tasks in various benchmark datasets, attaining state-of-the-art performance. They have also shown resilience in detecting complicated activities and dealing with variable-length input sequences. However, just like other models based on deep learning, LSTMs have several drawbacks for HAR: the requirement for vast volumes of labeled data, computational cost, and model interpretability.
Convolutional Neural Networks (CNNs) are a deep learning architecture that excels at processing image and video data. CNNs have been utilized in the setting of Human Activity Recognition (HAR) to automatically and reliably detect and classify human actions from sensor data.
The input data for HAR utilizing CNNs is often time-series data acquired by sensors. The time-series data is first transformed into a 2D image-like format, with time as the x-axis and sensor data as the y-axis.
The generated data matrix is then input into the CNN for the extraction and classification of features. Using a sliding window technique, CNN's convolutional layers apply filters to the incoming data. At different points in the input data, each filter takes a certain feature from it, including edges or corners.
The result of the convolutional layers is then passed into the pooling layers, which downsample the retrieved features while maintaining their crucial spatial correlations. The pooling layers' output is then smoothed and passed into fully connected layers that classify the retrieved features into distinct human activities. The output of the fully linked layers is then fed into a softmax function, which generates a probability distribution over the various activities.
The image below, taken from this paper, gives us an idea of how CNN’s basic framework works.
CNNs have the advantage of handling input information of different sizes and forms, making them well enough to interpret sensor data from various devices. Furthermore, CNNs may learn hierarchical feature representations of data input, allowing them to acquire low-level and high-level elements essential to human activity identification.
Human Activity Recognition is already used in multiple fields, with new applications appearing all the time. Let’s go through a few flagship examples.
Human Activity Recognition (HAR) can analyze sports performance in various ways. It may be utilized to track and analyze athletes' movements during competition and training, anticipate new injury risks, assess the effectiveness of different training programs, follow individual athletes' growth, and examine team sports' tactical and strategic components.
For example, HAR can be used to analyze badminton players' movements during attempting to hit and smash, track runners' movements and identify possible overuse injuries, monitor soccer players' performance during a game, track tennis players' movements throughout a match and identify the areas for enhanced foot movement and positioning, or analyze basketball players' actions during a game to recognize possibilities to improve team defense and ball movement.
Human Activity Recognition (HAR) has numerous uses in self-driving cars. HAR may be employed to detect people and other vehicles on the road, increasing the effectiveness and security of self-driving automobiles.
HAR, for example, may be utilized to identify and monitor the motions of pedestrians, bicycles, and other automobiles in the environment, allowing self-driving cars to predict and prevent collisions.
HAR can also recognize driver behavior, such as hand signals and head movements, which can help self-driving cars communicate with human drivers.
Human Activity Recognition can be used to identify and classify human gestures and movements, which can be utilized to improve computer system usability and accessibility.
HAR can be used to enable gesture-based commands of electronic devices like smartphones and smart TVs, resulting in an even more natural and easily understood user interface. HAR can also provide voice-based automation of computer systems, such as virtual personal assistants and chatbots, allowing for more practical and effective communication with computers.
Furthermore, HAR can monitor computer users' health and wellness by identifying and categorizing their physical movements and behaviors, which can help prevent and reduce the harmful impacts of prolonged computer use, including eye strain, back pain, etc.
Human Activity Recognition has several uses in the gaming industry. HAR is capable of helping recognize and classify various player actions and gestures, allowing for more immersive and participatory gaming experiences.
For instance, HAR may enable motion-controlled gaming, translating the player's movements and gestures into in-game activities such as swinging a sword or throwing a ball. HAR can also provide gesture-based manipulation of in-game panels and settings, making navigating the game more convenient and intuitive.
Furthermore, HAR can track a player's physical exercise and motions while playing. A game, for example, may compensate the player for completing a certain amount of steps or executing a particular workout.
As it permits automatic video analysis and interpretation, HAR has become an increasingly relevant tool in smart surveillance. It can improve the protection and security of public areas and vital infrastructure.
HAR can recognize and classify human activities like walking, running, loitering, and even suspicious actions such as carrying weapons or goods. This system can detect anomalous or repetitive activity patterns, such as lingering in a closed area or leaving an object unattended, and send notifications to security officers.
Furthermore, in real-time, HAR may identify persons, particularly in crowded locations, by assessing their stride, stance, and other physical traits, even if the face is concealed or covered. This system can also follow people throughout the surveillance area, allowing security officers to find and track prospective suspects. However, it presents privacy concerns, which must be handled with suitable legislation and protections.
Let’s review a few of HAR's most important ready-to-use datasets.
A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips covering 700 human action classes. The videos include human-object interactions, as well as human-human interactions. The Kinetics dataset is great for training human action recognition models.
Volleyball is a video action recognition dataset. It has 4830 annotated frames handpicked from 55 videos with nine player action labels and eight team activity labels. It contains group activity annotations as well as individual activity annotations.
The Action Recognition in the Dark (ARID) dataset is a benchmark dataset for action recognition in low-light conditions. It includes over 3,780 video clips featuring 11 action categories, making it the first dataset focused on human actions in dark videos. The ARID dataset is an important resource for researchers and practitioners working on improving action recognition algorithms in challenging lighting conditions.
DAHLIA dataset is focused on human activity recognition for smart-home services, such as user assistance.
Videos were recorded in realistic conditions, with 3 Kinect v2 sensors located as they would be in a real context. The long-range activities were performed in an unconstrained way (participants received only a few instructions) and in a continuous (untrimmed) sequence, resulting in long videos (40 min on average per subject).
The Human Activity Recognition Using Smartphones Data Set is a publicly available dataset that contains sensor readings from a smartphone's accelerometer and gyroscope captured during six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.
The dataset includes 3-axial linear acceleration and 3-axial angular velocity measurements captured at a constant rate of 50Hz. The sensor data was collected from 30 volunteers wearing a Samsung Galaxy S II smartphone on their waist while performing the activities. Each volunteer was asked to perform each activity for approximately 2-3 minutes, resulting in 10,299 instances.
Human Activity Recognition (HAR) is an intriguing technology with many applications. HAR recognizes and classifies human activities and movements using machine-learning techniques and sensors. It can transform various sectors, including healthcare, sports performance analysis, gaming, intelligent monitoring, and human/computer interface.
Yet, to meet ethical and privacy concerns, the creation and execution of HAR must be done ethically and transparently. It is essential to guarantee that the data utilized to train and assess HAR algorithms is representative, diverse, and unbiased.
The article also discussed how deep learning-based HAR approaches outperform conventional algorithms for machine learning, highlighting the outlines of CNN, CNN-LSTM, and RNN.
In conclusion, HAR has the potential to alter our daily lives and have a significant beneficial effect on society as it evolves and improves.
References
Building AI products? This guide breaks down the A to Z of delivering an AI success story.