|Computers, Materials & Continua |
A Hybrid Duo-Deep Learning and Best Features Based Framework for Action Recognition
1National University of Sciences and Technology (NUST), Islamabad, 46000, Pakistan
2Department of Computer Science, HITEC University, Taxila, Pakistan
3College of Computer Engineering and Science, Prince Sattam Bin Abdulaziz University, Al-Kharaj 11942, Saudi Arabia
*Corresponding Author: Muhammad Naeem Akbar. Email: email@example.com
Received: 15 February 2022; Accepted: 06 April 2022
Abstract: Human Action Recognition (HAR) is a current research topic in the field of computer vision that is based on an important application known as video surveillance. Researchers in computer vision have introduced various intelligent methods based on deep learning and machine learning, but they still face many challenges such as similarity in various actions and redundant features. We proposed a framework for accurate human action recognition (HAR) based on deep learning and an improved features optimization algorithm in this paper. From deep learning feature extraction to feature classification, the proposed framework includes several critical steps. Before training fine-tuned deep learning models – MobileNet-V2 and Darknet53 – the original video frames are normalized. For feature extraction, pre-trained deep models are used, which are fused using the canonical correlation approach. Following that, an improved particle swarm optimization (IPSO)-based algorithm is used to select the best features. Following that, the selected features were used to classify actions using various classifiers. The experimental process was performed on six publicly available datasets such as KTH, UT-Interaction, UCF Sports, Hollywood, IXMAS, and UCF YouTube, which attained an accuracy of 98.3%, 98.9%, 99.8%, 99.6%, 98.6%, and 100%, respectively. In comparison with existing techniques, it is observed that the proposed framework achieved improved accuracy.
Keywords: Action recognition; deep learning; features fusion; features selection; recognition
Human Action Recognition (HAR) was formally founded by Leonardo da Vinci (1452–1519), who was interested in human appearance and motion analysis, in order to teach students how to correctly draw people . At the moment, most research is focusing on HAR, which has a wide range of applications such as TV production, entertainment, education, social studies, security , intelligent video surveillance , home monitoring, human–machine interfacing, video storage and retrieval , assistive living and assistant robots . It covers a wide range of research topics, including human detection in video  human pose estimation, human tracking, and the analysis and comprehension of human activities . In the last 10–15 years, there has been significant progress in HAR research, which has resulted in commercial products .
Action recognition may include a large variety of human action depending upon the requirements for various applications . Like in security or surveillance applications running, Jumping, pushing, and punching is important however sitting, mobile calling also showed much concern . Still, image-based HAR is for identifying the action of a person from a single image without considering temporal information . Action representation and analysis-based HAR involves feature representation using feature extraction techniques and machine learning techniques . Abnormal activity detection are used for video surveillance to prevent crime or for inspecting crime scene . Action classification is a pattern recognition and machine learning problem and many techniques are introduced in the literature . The famous techniques are graph-based, SVM , nearest neighbor , HMM, ELM, and named a few more . Graph-based methods are used to classify input features of HAR that include Random Forest (RF), Geodesic Distance Isograph (GDI), to name a few . The concept of Support Vector Machine (SVM) is to separate the data points using a hyperplane . Non-linear data is classified by performing multi-class learning using a one-vs.-one SVM classifier having a polynomial kernel . These techniques give the better results for smaller set of datasets but for the larger size of dataset, the accuracy goes down and computational time jumped up .
Deep Learning (DL) is a technique that instructs computers to perform the task similar to that of the naturally conducted tasks by a human brain. Convolutional Neural Network (CNN) , RNN, Long Short-Term Memory (LSTM), Deep Belief Network (DBN), as well as Generative Adversarial Network (GAN) are widely used networks for the action recognition task . In CNN, maps are created using local neighbourhood information for each image that extracts the deep features . Convolution, activation, pooling, fully connected, and output layer are all important layers in the CNN architecture. Because of the 1D vector size, features are extracted from fully connected layers. When the extracted features are insufficient for classification, a few researchers have used fusion techniques . According to the analysis of recent studies, some redundant features are also added during the fusion process; thus, the researchers used features reduction and selection techniques. Only important features are chosen for the final classification using features selection techniques .
Several techniques have been introduced by computer vision researchers from last couple of years for human action recognition (HAR) . They focused on both classical techniques and deep learning based techniques . The classical techniques are based on the region of interest detection, traditional feature extraction such as shape, texture, and point, features reduction, and classification through machine learning methods . The deep learning showed much improvement in the recent years for several applications and HAR is one of the most emerging applications . Through deep learning, the researchers employed deep features from the dense layers and further optimized information through some feature selection techniques. A few researchers employed skeleton based information to get the movement of human in the video frames and then trained on deep learning networks. Shen et al.  presented a 3D skeleton based framework for HAR. The features are extracted through skeletons and trained a LSTM model. Then through time series data, skeleton points are composed in the network and obtained a complex LSTM network. The experimental process was conducted on three publicly available datasets and achieved improved recognition accuracy. Xie et al.  presented a temporal CNN based architecture for HAR. The CNN is opted at the first step for the information of inputs. After that, a novel cross-spatial temporal graph CNN is introduced to get the joints information. The modeling capability is enhanced after that by employing temporal attention layer. Three datasets are employed for the experimental process and indicated improved accuracy. Zhang et al.  suggested a new approach for action recognition which contains the fusion of CNN and BLSTM networks. In this method, an algorithm based on swarm intelligence is also introduced for recognizing the optimized hyperparameter of the deep neural networks. They tested their approach by using UCF101, UCF50, and KTH datasets that show improved recognition results. Nazir et al.  introduced the dynamic Spatio-temporal Bag of expression (DSTBoE) technique for HAR to overcome the problems related to occlusion, interclass variations, and view independence noticed in realistic scenes. In this technique, SVM is used for classification. They applied their approach by using UCF50, UCF11, KTH, and UCF sports datasets and obtained an accuracy of 94.10%, 96.94%, 99.21%, and 98.60%, respectively. Rahimi et al.  presented a kernelized Grassmann manifold learning-based method for HAR to overcome the issues related to outliers and noise in the given training data. They evaluated their approach by using UCF101, UTD-MHAD, MSR action 3D, UCF sport, and KTH datasets and attained higher accuracy on all used datasets.
Kumar et al.  introduced a novel Gated RNN method for action recognition. They evaluated their approach by using UCF Sports dataset and attained an accuracy of 96.8%. Kiran et al.  suggested a novel approach using deep learning for action classification. They used a pre-train deep model named resnet50. Global Avg pool and FC layer are used to extract deep features and then fuse them. Finally, the feature vector is used for classification by a classifier. They used KTH, UT-Interaction, UCF YouTube, UCF Sports, and IXMAS datasets for their approach and attained better accuracy. Li et al.  suggested a new residual network for HAR based on feature fusion and Global Avg pooling (GAP). They tested their technique by using CAVIAR, UCF101, UCF11, UT-Interaction datasets and attained accuracy above 90%. Ahmed et al.  suggested a novel pose descriptor for action recognition. They tested their approach by using HCA, CASIA, UCF11, and UT-Interaction datasets and obtained an accuracy of 88.72%, 98%, 99%, and 96.4% respectively.
The preceding methods concentrated on HAR’s temporal data. They improved accuracy by using several publicly available datasets. Few scientists concentrated on the fusion process. The combination of multi-level features improves recognition accuracy while increasing computational time. The major challenges that researchers are still facing: (i) deep learning networks are used to extract the most relevant features; (ii) redundant and irrelevant features not only reduce accuracy but also increase computational time. In this paper, we proposed a new framework for HAR based on deep learning and a hybrid PSO algorithm. Our most significant contributions are as follows:
• Fine-tuned MobileNet V2 and DarkNet 53 pre-trained deep learning models on action recognition datasets using transfer learning. The weights of the 50% layers have been freeze instead of all layers except new FC layer.
• Features are extracted from middle layers (average pooling and convolution) and fused by employing canonical correlation analysis (CCA) approach. The CCA technique is refined by single threshold function called Non-Redundant Function.
• A hybrid particle swarm optimization algorithm is opted. The proposed hybrid algorithm is based on the crow search optimization output features.
The rest of the article is organized in the following order: Proposed methodology for HAR is presented in Section 2. This section includes the convolutional neural network (CNN) framework, deep learning features extraction, fusion of features, and selection of best features using hybrid PSO algorithm. Section 3 presents the experimental results and discussion of the proposed framework. Finally, the Section 4 concludes this article.
2 Proposed Methodology
The proposed method encompasses features extraction followed by features fusion and selection of optimized features which are used for classification by various supervised learning-based classifiers to identify the human actions. Fig. 1 illustrated the proposed framework of HAR. In this figure, it is illustrated that the initially original video frames are normalized and train fine-tuned deep learning models – MobileNet-V2 and Darknet53. The pre-trained deep models are utilized for the features extraction that is fused using canonical correlation approach. After that, an improved PSO based optimization algorithm is opted for best features selection. The selected features were subsequently utilized for classification of actions through different classifiers.
2.1 Datasets Description and Normalization
In this work, we utilized five publicly available datasets such as KTH , UT-Interaction , UCF Sports , Hollywood , and UCF YouTube . The KTH Human activities dataset contains six activity classes: walking, running, jogging, boxing, waving, and clapping. The UT-Interaction consists of 6 action classes, namely pushing, pointing, kicking, punching, handshaking, and hugging. The UCF Sports dataset contains 13 action classes such as divingside, skateboarding front, run side, and named a few more having total 182 video sequences. The Hollywood dataset consists of 8 action classes, whereas the numbers of video sequences are above 600. The YouTube action dataset consists of 10 action classes such as basketball, walking, tennis swing, biking, swing, golf swing, and few more, whereas the number of video sequences are more than 1600. Each dataset video sequences are converted into frames and resized into . A few sample video frames are illustrated in Fig. 2.
2.2 Convolutional Neural Network
Convolutional neural network (CNN) is a strong deep neural network for object recognition and image classification . All the neurons in a CNN are connected in a feed forward fashion to the next layer’s neurons . A CNN model consists of several layers such as convolution, pooling, ReLu, and fully linked layers.
The convolution layer is responsible for recognizing and extracting local features derived from a sample of input image where, and for square input. Consider the following image as an example of an input: , where the total number of training datasets is denoted by . The result for each input image is , where and indicates the number of classes. A kernel slides over the input picture as in the convolution layer. The following relation is used to extract local features:
where generates a feature map for the layer, and the trainable parameters for layer are , denotes the function of activation.
A pooling layer, a non-linear down sampling approach, is also used in CNN. It effectively integrates two essential ideas, first is max pooling and second is convolution. The first stage aggregates a collection of maximal results for feature reduction as well as resistance to noise and volatility. The following equation describes the max-pooling configuration:
where, and denote the real number weight values. A completely linked feed forward layer, fully connected follows the convolution and pooling layers. It works on the same principles as a conventional fully connected feed forward network, with the addition of a set of inputs and outputs.
where, is the output of last connected layer that includes 1D weight matrix.
2.3 Transfer Learning
Transfer learning (TL) is the process of reuse a pre-trained model for another task. The main purpose of TL is to reuse a CNN network for less number of training data. Mathematically, TL is defined as: A source domain is provided , where with specific learning objective, , and Target domain , having learning task . The size of the training data is , and the corresponding labels are and . TL’s major role is to enhance the target function learning ability and leveraging the information from the source and the target . Visually, the process of TL is presented in Fig. 3.
2.4 Deep Learning Features
In this work, we utilized two pre-trained deep learning models named- MobileNet-V2 and DarkNet-53 for deep features extraction. MobileNet-V2  is a light weight CNN model having 5.2 million parameters. This network includes 53 deeper layers and accepts an input of dimension . This network includes residual layers that later converted high dimensional input into light dimensional output. This network includes convolutional layers, bottleneck layers, residuals, and fully connected layer. Originally, this network trained on ImageNet dataset having 1000 object classes. In this work, we fine-tuned this model and removed the FC layer. Then, a new layer is added and trained on action recognition datasets. The training is performed through TL. The working of TL is described in Fig. 3 and Section 2.3. Features are extracted from the global average pooling layer of the fine-tuned model and attained a feature vector of dimension . The DarkNet-53  is light weight and much better than DarkNet19 and ResNet 101 deep models. This network accepts the input of an image . The total number of parameters of this network is 41.6 million. Originally, this network is also trained on ImageNet dataset having 1000 object classes. In the fine-tuning process, we removed the last layer named Conv53 and added a new layer named New_conv53. After that, the entire model is connected and trained through TL on action datasets. Features are extracted from the trained deep model of layer average pool having dimension . After that, the extracted features are fused using canonical correlation based approach.
2.5 Features Fusion
In this work, we opted canonical correlation analysis (CCA)  for deep features fusion. Consider two feature vectors α =  β =  . In this paper it is assumed that both α and β are centered and also scaled. They were normalized with = 1 and = 1. CCA gets pair of linear transformations, one for each of the sets of variables, such that when the set of variables transformed, the corresponding coordinates are maximally correlated. Mathematically, the method computes two projection vectors such that CCA maximized.
As correlation coefficient is invariant, hence Eq. (4) can be written as:
This problem can be solved by using the Lagrange multiplier method that resorts to the generalized eigenvalue problem as follows:
If is non singular the above problem is equivalent to:
Here , are called the canonical vectors. Based on the above formulations (Eqs. (4)–(9)), the final vectors are obtained by the following equation:
The resultant fused vector is obtained of dimension after applying Eq. (10). The resultant vector consists of several redundant features that was analyzed during the experimental process. Therefore, we opted an improved PSO based algorithm for best features selection.
2.6 Features Selection
In this work, we opted an improved feature selection algorithm named Particle Swarm with Crow Search Optimization (PSO-CSA). The main working of feature selection algorithm is based on the following three steps. In the first step, features are selected through PSO based algorithm . In the second step, the resultant feature vector is passed to CSA for further refinement . In the third step, best selected features are further refined using entropy based function for final selection. The Fine-KNN is utilized as a fitness function and error is computed as a loss of selected features. The final selected features are finally classified using machine learning classifiers. The working of algorithm is defined as follows:
Consider, we have particle locations and velocities in a space of dimensions offer solutions initiated in a random circumstance in the PSO algorithm. In iteration , the solution of particle is as indicated in Eq. (11). The current particle solution is then updated for the local and global domains, which are computed using Eqs. (12) and (13) respectively.
The terms and refer to cognitive and social factors, respectively, [0,1] are the random values and . The inertia weight is denoted by , describes how the particle's prior velocity impacts the velocity of the subsequent iteration. The value of is determined by Eq. (13).
The best attained features are represented by of dimension . This resultant vector is further passed to CSA for the further refinement in the selected features.
Consider, we have dimension input PSO vector (population) and the population has solutions (number of crows). The position of each crow at iteration is described by vector as for , where is the crow ’s probable position solution in dimension . If a crow wants to take food from another crow , one of two things could happen: (i) Crow doesn’t really track crow but crow will find crow ’s food storage and update its position according to Eq. (15).
where length of the flight is denoted by and has been a random number belongs to When crow feels that crow following her to find her food. In this case, the crow travels at random to deceive the crow . The two situations can be mathematically combined as follows Eq. (16):
where awareness probability of crow at iteration is represented by , and are the random numbers belong to .The value of flight length has an impact on crow's ability to seek. High values of flight length make a significant contribution to global search and low values on the other hand help with local searches. During the algorithm's execution, each crow is evaluated using a well-defined fitness function (Fine-KNN). The crows then change their places based on the fitness score. Each new position is checked for viability. According to Eq. (17), the crow's memories are modified as follows:
The resultant features of dimension are passed to entropy function and sort into descending order. From the sorted vector, the top 90% features are selected for final classification. In the classification phase, several classifiers are opted, mentioned in the Results section.
3 Results and Discussion
The results of proposed framework of HAR are presented in this section in terms of numerical values, time plots, and confusion matrixes. Six publicly available datasets are employed in this work for the experimental process- KTH, UT-Interaction, UCF Sports, Hollywood, IXMAS, and UCF YouTube. During the training process, several hyper parameters are considered for training of pre-trained models such as learning rate is 0.005, epochs are 100, drop out facto is 0.5, per epoch iterations are 30, and stochastic gradient descent (SGD) optimizer. The 50:50 approach is opted where the K-Fold cross validation is utilized and value of K = 10. The performance of each dataset is computed based on the several measures such as accuracy, time, and recall rate. The classification accuracy is computed using several machine learning classifiers such as LDA, SVM, KNN, and Bagged tree. The entire framework simulations are conducted on MATLAB2021a using Desktop computer having 16GB of RAM and 8GB of graphics card.
KTH Results: Tab. 1 presents the numerical results of proposed HAR framework on KTH dataset. In this table, Cubic SVM classifier attained the maximum accuracy of 98.30% in 240.89 (s). The recall rate is also computed of this classifier that is 0.98. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 4. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 90%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 5. This figure shows that the LDA classifier execution time (14.62 s) is minimum than the rest of the classifiers. Moreover, the MGSVM classifier consumed higher time of 307.42 (s). Overall, the Cubic SVM classifier shows the better recognition performance.
UT-Interaction Results: Tab. 2 presents the numerical results of proposed HAR framework on UT-Interaction dataset. In this table, Fine KNN classifier attained the maximum accuracy of 98.90% in 21.22 (s). The recall rate is also computed of this classifier that is 0.99. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 6. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 88%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 7. This figure shows that the LDA classifier execution time (17.25 s) is minimum than the rest of the classifiers. Moreover, the Bagged Tree classifier consumed higher time of 58.34 (s). Overall, the Fine KNN classifier shows the better recognition performance.
UCF Sports Results: Tab. 3 presents the numerical results of proposed HAR framework on UCF Sports dataset. In this table, LDA classifier attained the maximum accuracy of 99.80% in 27.23 (s). The recall rate is also computed of this classifier that is 1.00. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 8. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 90%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 9. This figure shows that the LDA classifier execution time (27.23 s) is minimum than the rest of the classifiers. Moreover, the MGSVM classifier consumed higher time of 516.16 (s). Overall, the LDA classifier shows the better recognition performance.
Hollywood Results: Tab. 4 presents the numerical results of proposed HAR framework on Hollywood dataset. In this table, LDA classifier attained the maximum accuracy of 99.60% in 30.93 (s). The recall rate is also computed of this classifier that is 0.99. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 10. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 90%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 11. This figure shows that the LDA classifier execution time (30.93 s) is minimum than the rest of the classifiers. Moreover, the Cubic SVM classifier consumed higher time of 730.73 (s). Overall, the LDA classifier shows the better recognition performance.
IXMAS Results: Tab. 5 presents the numerical results of proposed HAR framework on IXMAS dataset. In this table, Cubic SVM classifier attained the maximum accuracy of 98.60% in 585.83 (s). The recall rate is also computed of this classifier that is 0.99. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 12. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 86%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 13. This figure shows that the LDA classifier execution time (28.00 s) is minimum than the rest of the classifiers. Moreover, the MGSVM classifier consumed higher time of 767.46 (s). Overall, the Cubic SVM classifier shows the better recognition performance.
UCF YouTube Results: Tab. 6 presents the numerical results of proposed HAR framework on UCF YouTube dataset. In this table, Cubic SVM classifier attained the maximum accuracy of 100% in 225.83 (s). The recall rate is also computed of this classifier that is 1.00. The performance of this classifier can be further verified through a confusion matrix, illustrated in Fig. 14. This figure described that the correct predicted values, given in the diagonals. The accuracy for the rest of the classifiers is also computed, as described in this table that shows that the average accuracy is above 95%. Moreover, the computational time of each classifier is also noted, as plotted in Fig. 15. This figure shows that the LDA classifier execution time (28.00 s) is minimum than the rest of the classifiers. Moreover, the Quadratic SVM classifier consumed higher time of 474.46 (s). Overall, the Cubic SVM classifier shows the better recognition performance.
3.2 Discussion and Comparison
Fig. 1 represents the detailed architecture of HAR using deep learning and optimization method. Experimentation has been carried out for 6 different datasets. For each dataset, ten different classifiers have been applied to ascertain performance in detail and results are displayed in the form of Tabs. 1–6, which include accuracy and execution time of all classifiers. For UCF sports and Hollywood datasets, LDA classifier turns out to be the best performing option giving accuracy above 99%, alongside fastest execution time (less than 31 s). Whereas for 3 datasets (KTH, IXMAS and UCF YouTube), Cubic SVM showed best accuracy which remained above 98%, however, execution time was best with LDA classifier (less than 28 s). KNN classifier was the best performing classifier for only 1 dataset (UT-Interaction), where accuracy of 98.9% was achieved but execution time remained least for LDA classifier (14.62 s). Analyzing overall performance, it is revealed that LDA is the best performing option for all datasets primarily in terms of execution time and accuracy, whereas, cubic SVM proves to be second best choice with a good accuracy but longer execution time. At the end, a comparison is conducted with state of the art (SOTA) techniques, given in Tab. 7.
Human action recognition has been an active research topic in recent years, owing to recent advances in the fields of machine learning and deep learning. Computer vision researchers have introduced a number of techniques, with a focus on both classical and deep learning-based techniques. Due to similar actions and a large number of video sequences, traditional techniques did not perform well. We proposed a new framework in this paper that is based on the fusion of deep learning features and an improved PSO-based algorithm. The proposed framework was tested on six action datasets and found to be more accurate. Based on the results, we concluded that the fusion framework achieved higher accuracy, but the process lengthens the computational time. The long computational time is alleviated further by an improved PSO algorithm. The trapezoidal rule-based PSO algorithm will be modified and used for feature selection in the future. Furthermore, recent deep learning, LSTM, and reinforcement learning techniques will be considered for HAR in the future [55–57].
Acknowledgement: We are thankful to National Institute of Science and Technology for overall support.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|