Timely identification and treatment of medical conditions could facilitate faster recovery and better health. Existing systems address this issue using custom-built sensors, which are invasive and difficult to generalize. A low-complexity scalable process is proposed to detect and identify medical conditions from 2D skeletal movements on video feed data. Minimal set of features relevant to distinguish medical conditions: AMF, PVF and GDF are derived from skeletal data on sampled frames across the entire action. The AMF (angular motion features) are derived to capture the angular motion of limbs during a specific action. The relative position of joints is represented by PVF (positional variation features). GDF (global displacement features) identifies the direction of overall skeletal movement. The discriminative capability of these features is illustrated by their variance across time for different actions. The classification of medical conditions is approached in two stages. In the first stage, a low-complexity binary LSTM classifier is trained to distinguish visual medical conditions from general human actions. As part of stage 2, a multi-class LSTM classifier is trained to identify the exact medical condition from a given set of visually interpretable medical conditions. The proposed features are extracted from the 2D skeletal data of NTU RGB + D and then used to train the binary and multi-class LSTM classifiers. The binary and multi-class classifiers observed average F1 scores of 77% and 73%, respectively, while the overall system produced an average F1 score of 69% and a weighted average F1 score of 80%. The multi-class classifier is found to utilize 10 to 100 times fewer parameters than existing 2D CNN-based models while producing similar levels of accuracy.
The onset of fatal diseases such as cardiac arrest or brain stroke could start with relatively milder symptoms such as a headache or chest pain. There are high chances for these symptoms to go unnoticed either due to a shortage of medical professionals or due to the ignorance/carelessness of the patients. These are further exacerbated when the patient is alone and does not have frequent human contact. Early diagnosis of these symptoms could result in timely treatment and faster recovery in patients. Given this context, automated detection of medical conditions without human intervention could help notify medical professionals and can significantly help facilitate timely medical help.
Recent technological advances have helped identify medical conditions [
Existing medical condition identification systems are developed considering the medical condition to be diagnosed and the data source sensors. A category of these systems [
Besides sensor-based systems, medical condition identification based on video camera inputs and depth maps have been employed in various cases, such as for elderly help, fall detection, depression detection [
RGB sequences refer to data captured from a video camera where three 2D-Matrices R, G and B are available for every frame in the video sequence. Models related to 3D-CNN [
This paper focuses on developing a low-complexity, highly interpretable process to identify medical conditions from the 2D skeletal data. In alignment with the objective, the below contributions have been made:
A sampling procedure to utilize skeletal data at specific time instances to compensate for variation in action duration across different subjects and instances. Derived three categories of skeletal features representing the actions associated with medical conditions, namely: angular motion features (AMF), positional variation features (PVF) and global displacement features (GDF). These features are subsequently validated on the NTU (RGB) skeletal datasets to show their discriminative capability. Development of a 2-stage classifier. The first stage identifies if a given video sequence represents a medical condition. The second stage classifies the medical condition. The results are validated using the NTU (RGB) skeletal dataset.
The rest of the paper is organized as follows. Section 2 details the overall problem to be solved in conjunction with the scope of this work. The different types of features extracted from a temporal and spatial perspective are elaborated on in Section 3, along with their respective validation. Section 4 describes and validates the classifiers trained with the derived skeletal features for medical condition detection and identification. Section 5 captures the experimental results and comparison with other approaches. A summary of this work and its future developments are mentioned in Section 6.
In this section, the medical condition detection and identification system, as defined in
The 2D Skeletal data consists of 25 joints per person per frame represented by their (x, y) coordinates which translates to 25 * 2 = 50 features per frame. As could be noted, the feature size is significantly less when compared to the feature space of the RGB data, which for a standard resolution frame is 640 * 480 * 3 = 921600. This results in lower model complexity, fewer data samples for model training and a better representation of action without involving background, texture and other covariates.
This paper proposes three significant contributions to the medical condition identification system defined in
A medical condition such as falling, headache, etc., is usually accompanied by changes in the sudden movement of elbows, knees and other regions in the 2D skeletal data. Representative features are explicitly derived to detect and identify medical conditions. These features are observed to be reasonably interpretable in nature and align closely with skeletal movement during medical conditions.
Three categories of derived features are computed, namely the: angular motion features (AMF) to represent the angular variation in joints, positional variation features (PVF) to capture variation in relative position and global displacement features (GDF) to capture the movement of the entire body. These features are elaborated in Section 3 and rigorously validated on the NTU (RGB) [
(a) Volume and type of data | |
---|---|
Description | Value |
No. of medical condition related actions | 9 |
No. of day-to-day actions | 51 |
Total no. of unique actions | 60 |
No. of samples per action | 948 |
Total no. of samples | 56880 |
(b) Medical condition data | |
---|---|
List of medical condition related actions | |
Cough/Sneeze | Back pain |
Staggering | Neck pain |
Falling | Nausea |
Headache | Fan self |
Chest pain |
The objective of the medical condition detector is to process incoming video sequences and notify when a potential medical condition has occurred. The features derived from 2D skeletal data are utilized to identify these medical conditions’ occurrence. The derived features are computed at a frame level and aggregated over an action, resulting in multi-dimensional time-series data. Different types of time series classifiers, such as distance [
Considering the above factors, the LSTM [
The medical condition identifier categorizes the exact nature of the condition after successful detection. Derived 2D skeletal features are used to train the multi-class LSTM model to identify the medical condition. Data samples of each medical condition are collated and their respective 2D skeletal features are computed. This data is then used to train a multi-class LSTM where each class represents features derived from its corresponding medical condition. The details of the medical condition identifier and its validation is detailed in Section 4.
This section focuses on choosing the minimal interpretable set of features from 2D skeletal data to distinguish medical actions across space and time. From a time-based perspective, the video at the appropriate frequency level is sampled based on the duration of action. In terms of space, we utilize broadly three types of features, which are: angular motion features (AMF), positional variation features (PVF) and global displacement features (GDF).
Based on observations, a single action takes about four to ten seconds, depending on the speed at which a human being does medical condition-associated action. In a general case, videos are encoded at 30 fps, representing each action by frames ranging anywhere between 120 to 300. It is to be noted that immediate successive frames contain much less information than previous frames since the human body does not change positions significantly at 1/30th of a second. Hence, it’s essential to filter only the informative frames for further processing.
The duration for a specific medical condition-associated action could vary based on the test subject and different instances across time for the same subject. When the duration is shorter, it contains more information in consecutive frames and needs to sample at a higher frequency. On a similar note, for actions taking longer duration, the information in successive frames is less and it is generally acceptable to sample at lower frequencies. Given this, we propose an approach where the number of frames encoded in action is fixed as a constant (K), based on which the frequency of video sampling
To compute the sampling frequency, the time steps of interest are calculated using
Medical conditions generally have specific characteristics that could be utilized to select the right features. For instance, the nature of the action involved is particular to the person and does not involve any additional object or interaction with other people. Additionally, the background and locality have lesser relevance to the nature of the medical condition. Based on these characteristics domain specific custom features are derived from skeletal data and presented below:
Any occurrence of a medical condition should invariably result in the movement of different limbs of the human body. These variations are captured by the angle variation at a joint (Such as the elbow or knee) produced by two adjacent limbs. The pattern of variations of these angles is quite sensitive and representative of the medical condition. These angles are invariant to the video’s size and the morphological dimensions of the human performing the action. For every joint of interest, we form a triangle with the joint and the two adjacent points given in
S. No | Central joint angle | Adjacent points |
---|---|---|
1 | ∠ Left hip | Left knee, Hip center |
2 | ∠ Right hip | Right knee, Hip center |
3 | ∠ Left knee | Left ankle, Left hip |
4 | ∠ Right knee | Right ankle, Right hip |
5 | ∠ Left elbow | Left shoulder, Left wrist |
6 | ∠ Left shoulder | Left elbow, Neck |
7 | ∠ Right elbow | Right shoulder, Right wrist |
8 | ∠ Right shoulder | Right elbow, Neck |
9 | ∠ Shoulder center | Head, Left shoulder |
10 | ∠ Hip center | Chest Mid, Left hip |
The AMF features constitute a 10-dimensional time series capturing the variation in the joint angles across time. For every action, the average time series across different samples (also called a barycenter) is computed for each of the ten joint angles. The variance of the barycenter [
Medical condition | Left hip | Right hip | Left knee | Right knee | Left elbow | Left shoulder | Right elbow | Right shoulder | Head | Chest mid |
---|---|---|---|---|---|---|---|---|---|---|
Sneeze/Cough | 0.17 | 1.12 | 2.47 | 2.21 | 1472.21 | 45.54 | 339.72 | 33.01 | 14.96 | 0.46 |
Staggering | 1.13 | 1.66 | 41.74 | 33.379 | 5.5 | 1.16 | 2.87 | 0.66 | 4.81 | 0.22 |
Falling | 59.91 | 41.07 | 599.67 | 572.29 | 27.03 | 23.46 | 32.5 | 30.75 | 36.89 | 4.96 |
Headache | 0.12 | 0.4 | 1.49 | 1.61 | 1192.01 | 78.22 | 539.1 | 48.24 | 16.79 | 0.05 |
Chest pain | 0.23 | 0.49 | 2.53 | 2.27 | 431.2 | 4.15 | 266.97 | 2.05 | 18.23 | 0.63 |
Back pain | 0.5 | 0.51 | 2.34 | 1.93 | 150.61 | 3.4 | 130.7 | 2.56 | 7.16 | 0.07 |
Neck pain | 0.53 | 0.33 | 4.04 | 4.29 | 868.23 | 93.42 | 367.51 | 24.97 | 0.63 | 0.08 |
Nausea | 3.27 | 3.73 | 51.02 | 49.04 | 1164.07 | 159.96 | 766.42 | 135.95 | 53.43 | 0.83 |
Fan self | 0.66 | 0.63 | 4.55 | 4.48 | 887.58 | 4.69 | 270.39 | 2.63 | 2.88 | 0.06 |
Action |
All features | |||
---|---|---|---|---|
All features | AMF features | PVF features | GDF feature | |
Sneeze/Cough | 0.65 | 0.56 | 0.4 | 0.28 |
Staggering | 0.91 | 0.8 | 0.65 | 0.69 |
Falling | 0.94 | 0.89 | 0.74 | 0.81 |
Headache | 0.52 | 0.48 | 0.39 | 0.25 |
Chest pain | 0.65 | 0.53 | 0.38 | 0.22 |
Body pain | 0.76 | 0.64 | 0.44 | 0.17 |
Neck pain | 0.62 | 0.48 | 0.4 | 0.2 |
Nausea | 0.75 | 0.73 | 0.64 | 0.51 |
Fan self | 0.64 | 0.62 | 0.33 | 0.29 |
Weighted average | 0.72 | 0.64 | 0.48 | 0.38 |
Along with the motion of limbs in the human body, the positions of joints change relative to the reference during a medical condition. The relative position features capture the orientation of the nine different joints in the human body from the chest-mid region. The chest-mid region is closer to the body’s center and is considered the reference or centroid. These features are computed across time for the different sampled frames of interest. The angular direction of each joint of interest from the reference for a given frame is captured as part of the positional variation features (PVF), as illustrated in
S. No | Positional variation feature |
---|---|
1 | Centroid - Left hip |
2 | Centroid - Right_Hip |
3 | Centroid - Left_Knee |
4 | Centroid - Right_Knee |
5 | Centroid - Left_Elbow |
6 | Centroid - Left_Shoulder |
7 | Centroid - Right_Elbow |
8 | Centroid - Right_Shoulder |
9 | Centroid - Head |
Overall, for an action having
Medical condition | Left |
Right |
Left |
Right |
Left |
Left |
Right |
Right |
Head |
---|---|---|---|---|---|---|---|---|---|
Sneeze/Cough | 0.24 | 0.24 | 2.92 | 3.13 | 46.6 | 6.53 | 14.51 | 3.3 | 22.97 |
Staggering | 2.02 | 1.85 | 47.05 | 44.47 | 74.64 | 104.41 | 78.71 | 107.53 | 112.57 |
Falling | 0.97 | 0.95 | 38 | 35.99 | 48.55 | 103.62 | 45.87 | 124.81 | 218.42 |
Headache | 0.28 | 0.29 | 1.18 | 1.33 | 426.19 | 2.88 | 203.63 | 1.97 | 3.73 |
Chest pain | 0.16 | 0.18 | 3.44 | 3.16 | 36.6 | 8.17 | 31.07 | 12.45 | 44.14 |
Back pain | 0.23 | 0.23 | 0.49 | 0.94 | 144.7 | 2.8 | 117.2 | 1.69 | 0.49 |
Neck pain | 0.43 | 0.42 | 0.86 | 0.55 | 582.2 | 6.76 | 204.11 | 4.78 | 4.09 |
Nausea | 0,45 | 0.47 | 0.94 | 12.4 | 30.86 | 124.67 | 26.25 | 120.71 | 320.38 |
Fan self | 0.16 | 0.15 | 1.5 | 0.88 | 142.94 | 3.97 | 55.93 | 4.17 | 3.4 |
Additionally, the PVF features are used to train a classifier to distinguish the nine different medical actions in NTU RGB + D skeletal dataset, and the associated F1 Score is shared in
During a medical condition, apart from the motion of joints and limbs in a skeleton, the entire human body could result in variations of position across time. To capture this variation, the global displacement features are extracted to model the direction of the shift in the human skeleton over time in the sampled frames of skeletal data. These features are helpful when the human moves over the course of action. The direction of each centroid in subsequent frames relative to the centroid region in the first frame is computed as the global displacement feature (GDF). This process is illustrated in
Medical condition | Sneeze/ |
Staggering | Falling | Headache | Chest pain | Back pain | Neck pain | Nausea | Fan self |
---|---|---|---|---|---|---|---|---|---|
Body centroid variation | 33.36 | 2,878.51 | 3,851.28 | 13.51 | 50.84 | 21.18 | 11.29 | 424.99 | 9.82 |
Additionally, the GDF feature is used to train a classifier to distinguish the nine different medical actions in NTU RGB + D skeletal dataset and the associated F1 Score is shown in
As per the proposed framework, the incoming RGB video feeds from a commercial camera are used to extract 2D skeletal data with pose estimation modules such as Openpose. Derived features elaborated in Section 3 are computed from the 2D skeletal data to distinguish and identify medical conditions. As shown in
The first stage involves the development of a binary classifier to distinguish a potential medical condition from other day-to-day actions. By the end of this stage, timely notifications could be provided to inform the concerned systems/people that a possible medical condition has occurred. After detecting the occurrence of a medical condition in stage 1, a multi-class classifier in stage 2 is used to identify the medical condition. The multi-class classifier is trained with derived skeletal features about each medical condition such that each class corresponds to a specific medical condition.
Sections 4.1 & 4.2 elaborate on developing the classifiers for detecting and identifying medical conditions. In Section 4.3, the effectiveness of the 2-stage classifier is analyzed and computed.
The medical condition detection system aims to classify a short action video as a potential medical condition. Medical condition data in NTU (RGB) [
The derived features described in Sections 3.2 to 3.4 are computed on these data samples and then used to train a binary LSTM classifier. Stochastic learning methods such as Adam Optimizer [
The medical condition detector is found to provide a macro average F1 score accuracy of 0.77. The confusion matrix and performance metrics of the classifier are shown in
Precision | Recall | F1-score | Support | |
---|---|---|---|---|
Day-to-day actions | 0.9 | 0.88 | 0.89 | 5497 |
Medical conditions | 0.64 | 0.67 | 0.66 | 1704 |
Accuracy | 0.83 | 7201 | ||
Macro average | 0.77 | 0.78 | 0.77 | 7201 |
Weighted average | 0.84 | 0.83 | 0.84 | 7201 |
As part of the 2-stage process to identify the occurrence of a medical condition, the second stage involves the development of a multi-class classifier to determine the exact medical condition that has occurred. This classifier is only utilized on video data already detected as a potential medical condition by the stage 1 binary classifier. The nine different medical conditions present in the NTU (RGB) dataset [
An LSTM-based model is used to train the multi-class classifier utilizing Adam Optimizer for similar reasons. Hyperparameters such as the batch size, no of LSTM units and no of epochs are selected after observing the training and validation data accuracy curves. This process is detailed in Section 5.
Among the nine medical actions available in the NTU RGB + D skeletal dataset, a macro average F1 score of 0.73 was achieved. The performance metrics and the confusion matrix for the LSTM trained with the best configuration are presented in
Label\Metrics | Precision | Recall | F1-score | Support | |
---|---|---|---|---|---|
Sneeze\Cough | 0.63 | 0.6 | 0.62 | 190 | |
Staggering | 0.91 | 0.89 | 0.9 | 188 | |
Falling | 0.93 | 0.97 | 0.95 | 188 | |
Headache | 0.64 | 0.58 | 0.61 | 190 | |
Chest Pain | 0.69 | 0.58 | 0.63 | 190 | |
Body Pain | 0.67 | 0.82 | 0.74 | 189 | |
Neck Pain | 0.64 | 0.61 | 0.63 | 190 | |
Nausea | 0.71 | 0.83 | 0.77 | 189 | |
Fan Self | 0.72 | 0.67 | 0.7 | 190 | |
Accuracy | 0.73 | 1704 | |||
Macro average | 0.73 | 0.73 | 0.73 | 1704 | |
Weighted average | 0.73 | 0.73 | 0.73 | 1704 |
The 2-stage classifier processes day-to-day actions more often than medical conditions, which rarely occur. Most of these day-to-day actions are filtered by the stage 1 binary classifier, and only the actions detected as medical conditions are passed to the stage 2 medical condition identifier. This process results in the improvement of overall accuracy. The processing of test data with actual labels and the number of samples across the 2-stage classifier is explained in
The final classes that are identified are the day-to-day actions and the specific medical conditions. The test data samples classified in different categories are listed in
Actual/Predicted | Day-to-day actions | Medical conditions correctly identified | Medical conditions incorrectly identified |
---|---|---|---|
Day-to-day actions | 4876 | 0 | 637 |
Medical condition | 567 | 837 | 310 |
Precision | Recall | F1-Score | |
---|---|---|---|
Day-to-day actions | 0.90 | 0.88 | 0.89 |
Medical condition | 0.47 | 0.49 | 0.48 |
Macro average | 0.69 | 0.69 | 0.69 |
Weighted average | 0.80 | 0.79 | 0.80 |
The performance evaluation for the medical condition detection and identification classifiers are detailed in Section 5.1. The results are compared to related work in Section 5.2.
The binary and multi-class classifier’s hyperparameters are tuned by evaluating different configurations. The train and test data are segregated using an 80:20 stratified split for both classifiers. A dropout of 50% is introduced into the classifier to prevent overfitting. The batch size and the number of LSTM units are determined by observing the train and test accuracy curves shown in
We select batch size 16 and LSTM with 200 units as the ideal configuration for the binary classifier after observing the saturation of accuracy post 200 LSTM units and the accuracy curves being more stable in this configuration. The confusion matrix and performance metrics for the binary classifier are shared in
Similarly, for the multiclass classifier used to identify medical conditions, batch size eight and LSTM with 200 units are selected as the ideal configuration after observing the saturation of accuracy post 200 LSTM units and the accuracy curves being more stable with a batch size of 8. The confusion matrix and performance metrics for the multi-class classifier are captured in
The multi-class classifier works very well in identifying conditions such as falling and staggering where the F1 score is above 0.9, as shown in
Thus, the classifiers built for detecting and identifying medical conditions have shown the capability to distinguish actions on test data, despite the minimal features used.
In this section, the accuracy and complexity of our system are compared with existing work. To compare the complexity of this system, the number of parameters present in the binary and multiclass classifiers is calculated. This number is then compared with the parameters required by other generic CNN-based deep neural networks [
Data used | Model trained | No of parameters | F1-score |
---|---|---|---|
(NTU) Skeletal dataset | SqueezeNet | 747633 | 65.3 |
Inception V3 | 24481346 | 75.18 | |
DenseNet169 | 12566065 | 77.63 | |
ResNet34 | 21309809 | 77.77 | |
ResNet152 | 58244209 | 72.54 | |
VGG13 | 129151601 | 72.85 | |
VGG19 | 139770993 | 72.33 | |
(NTU) Skeletal medical classes | Medical identifier (LSTM) | 509009 | 73.4 |
(NTU) Skeletal dataset | Medical detector (LSTM) | 180002 | 77.4 |
Classifier name | Data used | Cross subject accuracy F1-score (%) |
---|---|---|
ST-LSTM [ |
3D Skeleton | 69.2 |
ST-GCN [ |
72.4 | |
GCA-LSTM [ |
74.4 | |
Medical detector (ours) | 2D Skeleton | 79.9 |
Medical identifier (ours) | 73 |
From
Classifier name | Data used | F1-score (%) |
---|---|---|
Shojaei-Hashemi’s classifier [ |
3D Skeleton | 93.4 |
Yin’s LSTM classifier [ |
98.6 | |
Fall detection identifier (ours) | 2D Skeleton | 95.2 |
Thus, we present a working procedure to detect and identify visual medical conditions in a non-invasive manner and have tested its accuracy on a standard NTU RGB + D 2D skeletal dataset. Our approach is highly scalable due to using common RGB data, which could be made available from traditional surveillance cameras. Our system tested on NTU RGB + D 2D skeletal data has produced average F1 scores of 77% for medical condition detection and 73% for medical condition identification. The developed system shows high accuracy in identifying differentiable medical conditions, moderate accuracy with difficult-to-discern actions and high interpretability. The number of parameters in the medical condition identification classifier is lesser by a factor of 10 to 100 compared to other deep learning classifiers on 2D skeletal data with comparable accuracy. This proves that our system is very computationally efficient and can be implemented on commodity hardware.
Due to its high scalability and non-invasive nature, the system could be utilized to monitor medical conditions such as cough and headache, which could be representative of highly infectious diseases during times of pandemic. For real-world usage, the system accuracy needs to be improved further, especially for the difficult to discern medical conditions. Based on this goal, we plan to explore and research more granular features related to medical conditions and possibly augment skeletal data with representative features derived from RGB images and depth-based data for improved accuracy.
The authors received no specific funding for this study.
The authors declare that they have no conflicts of interest to report regarding the present study.