|Computers, Materials & Continua |
Video Analytics Framework for Human Action Recognition
1Department of Computer Science, HITEC University Taxila, Taxila, 47080, Pakistan
2College of Computer Science and Engineering, University of Ha’il, Ha’il, Saudi Arabia
3Department of Electrical Engineering, College of Engineering, Jouf University, Sakaka, Saudi Arabia
4College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Khraj, Saudi Arabia
5Department of Computer Science and Engineering, Soonchunhyang University, Asan, Korea
6Department of Computer Science, COMSATS University Islamabad, Wah Campus, 47040, Pakistan
*Corresponding Author: Yunyoung Nam. Email: firstname.lastname@example.org
Received: 14 January 2021; Accepted: 19 February 2021
Abstract: Human action recognition (HAR) is an essential but challenging task for observing human movements. This problem encompasses the observations of variations in human movement and activity identification by machine learning algorithms. This article addresses the challenges in activity recognition by implementing and experimenting an intelligent segmentation, features reduction and selection framework. A novel approach has been introduced for the fusion of segmented frames and multi-level features of interests are extracted. An entropy-skewness based features reduction technique has been implemented and the reduced features are converted into a codebook by serial based fusion. A custom made genetic algorithm is implemented on the constructed features codebook in order to select the strong and well-known features. The features are exploited by a multi-class SVM for action identification. Comprehensive experimental results are undertaken on four action datasets, namely, Weizmann, KTH, Muhavi, and WVU multi-view. We achieved the recognition rate of 96.80%, 100%, 100%, and 100% respectively. Analysis reveals that the proposed action recognition approach is efficient and well accurate as compare to existing approaches.
Keywords: Video analytics; action recognition; features classification; entropy; data analytic
Action recognition based on human movements has drawn considerable interest due to its emerging applications in video analytics [1,2]. An emerging trend of video labeling for various actions within certain sports such as football, swimming, paragliding, and even in typical daily life movements  such as for forensic analysis require recognition that can be made at certain levels of abstraction . Interactive applications  already involve human computer interaction in which substantial amount of work has been done covering a broad range of topics . In the literature, most of the works have addressed very specific problems in action recognition. These problems are human-body movement, facial expression, image labeling, and perception of human object iteration . Some authors have also focused on introducing feature selection algorithms for distance-based similarity measures and SVM . Many techniques have been recently introduced for HAR, which may be categorized into graph based, trajectory based, codebook based, feature extraction based , to name a few . Wu et al.  presented a HAR method with graph based visual saliency and space-time nearest points. Gao et al.  introduced a hypergraph-based method to compute the distance between two objects at a multiview scenario. In these methods, vertices and edges of the objects are defined in cluster view. Edges join multiple points and weights are assigned to each edge based on their relationships between any two views in the group. Yi et al.  introduced trajectory based HAR. This method solves the problem of motion information between distinct motion regions. The method makes use of the trajectory based covariance features and performs better as compared to Histogram of oriented gradients and its variants. Althloothi et al.  presented HAR based technique based on motion and shape features, extracted using spherical harmonics.
In  introduced a new feature referred to as local surface geometric feature (LSGF). The LSGF features for human body posture and expression are extracted to be utilized further in the covariance matrix and feature vectorization. Chen et al.  presented depth motion maps (DMM). This method consists of four major steps such as depth map generation, features extraction by utilizing DMM, features reduction and recognition. The PCA is used for the dimensionality reduction and provides improved efficiency in recognition. The few other methods are 16-layers CNN , fusion of features , weighted segmentation based approach , fusion of deep and handcrafted , and name a few more .
However, most of the recent contributions based on features selection have not addressed frame enhancement, which we believe is a crucial step in making the foreground object more visible. For instance, the optical flow algorithm, proposed in  fails to segment the foreground object due to low-resolution videos and variation in motion speed. Similarly, various feature selection and extraction techniques, such as , do not consider optimization of the local and global features, which usually lead to lower classification accuracy . We believe that a sound feature enhancement technique coupled with efficient features optimization mechanism would result in increased classification accuracy. In order to achieve greater classification accuracy, a novel framework has been proposed that implements segmented frames fusion and Entropy-Skewness (ES) based features reduction. In what follows we enumerate the primary contributions of the proposed work, which also describes our research methodology in order:
• Construction of an enhanced HSI color space, utilizing a hybrid color transformation technique, which incorporates refinement of RGB channels, bottom-hat filtering, and NTSC transformation.
• An implementation of novel maximum regions based segmentation technique in which pixels’ fusion has been performed, using the proposed saliency mapped frame.
• Extraction of a hybrid set of features and their dimension reduction by using the entropy skewness control method.
• Construction of a feature codebook having a size of , using serial based features fusion. This is followed by an implementation of a genetic algorithm for prominent features selection. The selected features have a dimension of .
Finally, an extensive experimentation and comparison has been performed between the proposed and existing methods by implementing two use cases.
2 Proposed Framework
The proposed architecture consists of five major steps: a) Frame enhancement using new series of steps; b) introduction to maximum regions based segmentation technique with an integration of frames fusion with novel saliency map; c) extraction of texture, local, and global features using SFTA, LBP, and HOG; d) a novel features reduction technique is implemented based on Entropy-Skewness (ES) control method and then serial based feature fusion is performed for the construction of features codebook having size ; e) implementation of a custom made genetic algorithm for the selection of most optimal features prior to multi-class SVM for final classification. Fig. 1 show the detail of proposed method.
Foreground visibility is a major issue in this area which is addressed in this section. Frame enhancement is an important preprocessing step for an accurate segmentation of foreground objects because we are dealing with raw input video data . This data contains many distorted, noisy, and dull kind of images (weak edges and fused boundaries). To get improved images and quality information, we need to enhance these frames to get our desired results, and this is the main motivation behind our frames’ enhancement approach. Also, in few recent studies, an optical flow algorithm failed to segment the foreground object due to low-resolution videos. To handle this kind of problem, a new technique is implemented named as (BHNT SC–S) which incorporates two fundamental steps as bottom-hat filtering and color space transformations. The complete process is performed in parallel. Firstly, a conventional RGB frame is enhanced with bottom-hat filtering, which is subsequently utilized in the segmentation phase.
The fusion relation between the bottom-hat filter and NTSC transformation frame is given as; Let represents an original RGB frame having dimensions . The bottom hat-filtering technique is implemented on to enhance the brightness of the foreground object and reduce the background contrast with respect to black pixels. The bottom-hat filtering technique effectively works on tiny objects on a scattered background as follows:
where represents the bottom-hat frame, St represents the structuring element, which is initialized as 9 and is the closing operation. Then NTSC transformation is performed by utilizing to make the foreground object more visible. The NTSC transformation is performed as follows:
where l = , and represents an index for three channels of red, green, and blue, respectively. , , are modified red, green, and blue channel, respectively. The green channel is utilized for the gaussian mixture model (GMM) segmentation.
where is the NTSC frame. The enhanced NTSC frame is improved with Gaussian function and is further utilized for novel saliency segmentation. Finally, HSI transformation is performed on for maximal region segmentation as shown in Fig. 1. The visibility results are tested on each channel, however, the saturation channel produced better results. Hence, we select saturation channel for maximal region segmentation. The saturation channel is defined as , where channel is set as input for maximal region segmentation. The results of the preprocessing step are shown in Fig. 2. In this figure, it is showing that the original frames are initially processed in the green channel and then followed the bottom hat filtering and NTSC transformation. After that frames are reconstructed and get a saturation frame for further process.
2.2 Frames Segmentation
In this section, we segment the foreground objects for identification of their activities. The optical flow algorithm has been used for identification of motion regions in the frame. We then construct a novel saliency method, which is fused with a new maximal region segmentation technique. The optical flow algorithm is executed in parallel with the novel saliency method as shown in Fig. 1. The purpose of frames fusion is to obtain maximum accuracy and reduce the error rate.
Saliency map: Let represents optical flow function having three parameters for horizontal, vertical, and time ion (h, v, t) and represents a 3-dimensional enhanced frame. executes in parallel with to give motion information of a foreground object in the current frame. A chi-square distance function is performed on the resultant frame to calculate distance between the motion pixels. The motion pixels with minimum distance are considered as a salient object and pixels with maximum distance represent the background. The chi-square distance is calculated as follows:
where, T represents a selection of a salient object and the background. If , the chi-square distance between pixels is minimum which in turn labels it as a salient object, otherwise it is considered as background. Then color features of a salient object are extracted, which are effective for saliency estimation. RGB and LAB color spaces are used for features extraction and mean, range and entropy are calculated for each channel. The cumulative mean and standard deviation are calculated for the color frame. The mean value is used as a threshold value for frame binarization and the centered value of the frame is computed by as follows:
The center value is subtracted from the color image and a new mapped frame is obtained as follows:
We then perform an activation function to remove noise in the salient frame and make the object more visible.
where H denotes the number of neighbor pixels and is mean of the mapped frame. The noise removal function F(R) is performed on the mapped frame to get a new improved salient frame. The improved salient frame is defined as:
where, Im(sal) represents the improved salient frame. The graphical sample results are shown in Fig. 3.
Finally, we set a threshold function to obtain the binary image as follows:
where, denotes the cumulative mean value, which is computed from the color frame. Some morphological operations are performed to remove extra regions from the segmented frame. The saliency-based method is described in Algorithm 1. The results are shown in Fig. 4.
Segmentation Based on Maximum Region Extraction: The maximum extraction region has two primary steps. Firstly, a mask of input saturation channel is generated and secondly, a threshold value is obtained automatically using object magnitude for most significant regions in the masked frame. Later, few morphological operators are utilized to remove unused regions.
Mask generation: In mask generation of the saturation frame , we create a Zero matrix of size and set condition up to 1 as follows:
where p and q denote the number of pixels in one frame. We then set a dynamic threshold value and store the pixel value of the extracted object in the . The threshold is set as follows:
where is the threshold frame, T1 is the threshold value, which is automatically selected depending on the object magnitude. Further closing and filling operations are performed to make the segmented image more accurate. Algorithm 2 explains the segmentation process based on the maximum region extraction. The sample results are shown in Fig. 6d.
Frames Fusion: Frames fusion corresponds to the process of combining comparative information of two frames into a single frame. The fused frame is more accurate with respect to parent frames and contains comprehensive information compared to any single segmented frame. In this article, we implemented a novel frame fusion technique based on similar pixel values as shown in Fig. 5. The proposed fusion technique is simple but more effective as compared to above listed approaches. The fusion process follows the additive law of probability which overcomes the problem of over smoothness/over sharpness and provides the statistically balanced segmented frames. Moreover, enhancement procedure strengthens the weak edges that lead to an appropriate segmentation having clear boundaries.
Let denotes the all pixel’s values of both segmented frames. Let s1 denotes the pixel values of saliency frame Fin(sal), s2 denotes the pixel values of maximal region segmentation frame , and s3 denotes the common points of fin and . The frames fusion is computed as follows:
where denotes the number of occurrences of frame pixels, denotes the common pixels between two segmented frames, c denotes the complement of segmented frames and is fused segmented frame. The detailed segmentation and fusion results are shown in Fig. 6, which demonstrates the value of the fusion method.
2.3 Feature Descriptors
Feature extraction is very important for representation of an object . In this section, we are dealing with raw video data, which possibly contains faces, texture coatings, background objects, etc., with a variety of artificial makeup. To deal with these assortments we need to use a combination of features. Shape, SFTA, and LBP descriptors are extracted. The grayscale and the proposed fused segmented frames are used in the feature extraction phase. The SFTA features are extracted in three steps. Firstly, the fused segmented frame is used to make the set of binary frames. Secondly, the fractal features are calculated by using 8 neighborhood pixels.
Finally, we calculate the mean and size (pixels) of the segmented frame. By using 8 neighborhood pixels, a dimensional feature vector (FV) is obtained. For LBP features extraction, binary code is calculated for each pixel in the frame and compares it whether the intensity value of the pixel is greater or less than the current pixel intensity. Then a histogram is computed to count the number of occurrences of each binary code. The LBP features are defined as: where n = 7, m runs over 8 neighbors of the central pixels gc and s(u) is: . The final LBP FV has dimensioned for each frame, which is later utilized for fusion. Finally, the HOG features are extracted from fused segmented frame and obtained a vector of size . Later on, the proposed features reduction technique, entropy skewness, is implemented on these features.
Features Reduction using Entropy Skewness: A large number of features negatively hits the accuracy and increases computational time of the system [26,27]. The PCA is used in literature for dimensionality rebate/reduction. In this article, we compare our proposed features reduction technique with PCA in terms of five performance measures. The workflows of ES methods are shown in Fig. 7. The same size feature is used to analyze the information on the same dimensional frames to obtain the high similarity index before subjecting to the classification phase. For the proposed method, entropy and skewness value is calculated for all three types of extracted features. The entropy value for one frame features is calculated as follows:
where Ft denotes the total number of extracted features for one frame, P denotes the probability of occurrences of features and b = 10. Similarly, the mean and standard deviation are calculated for each frame feature for skewness value. The skewness value is computed by mean and SD, that are defined as: and Hence,
where , , and S denote the mean, standard deviation, and skewness of extracted features, respectively. Then we add both entropy and skewness values as:
Finally, 40, 30, and 400 features are selected from LBP, SFTA, and HOG respectively based on their mean value. The features are reduced to a value which is less than the mean value of . Then the remaining features are fused by serial-based fusion to build a codebook having dimensions of . The serial-based fusion is simple but more effective. Let , and be three extracted feature vectors having dimensions , , and , respectively. Then these features are added as:
Finally, we get the codebook of size . The constructed feature codebook is optimized by a custom-made genetic algorithm (GA) and selects the best features for action recognition.
Features Selection: The feature selection is performed on the fused vector in order to identify most relevant and uncorrelated feature data. For best features selection, we opted genetic algorithm which has the tendency to handle larger space problems even when the objective function is stochastic. In our proposed work, the input to the genetic algorithm is extracted codebook of size whereas; the optimized features are the output, given to the classifier. Mainly, the GA is comprised of the standard steps of population initialization, fitness calculation, crossover, mutation and finally selection. Amongst several existing crossover techniques, we opted for uniform crossover technique having crossover rate of 0.7. The , where and is crossover. The x1, x2 are selected parents. In the mutation, a uniform approach is applied of rate 0.1. For selection, we adopted the roulette wheel and defined as: where , Sp is sorted population Wc is the last number of population. The is selected for parent pressure, which is set to be 8. For our proposed case, the fitness function is defined to be a mean of chromosome as . In our case, this function guarantees the optimized solution. The newly generated feature sets are used in the classification phase. In The classification phase, Multi-Class SVM is employed for final features classification. The labeled results are showing in Fig. 8.
3 Experimental Setup and Results
The computational complexity of the proposed framework is linearly dependent on the input. For each pixel, , where N2, q2, and r2 are represents mass of input, search window, and patch respectively. This statement connects the total steps and operations perform in this work. The sum represents total required operations during the fusion step.
3.1 Selected Datasets
Weizmann dataset: Weizmann dataset  is considered a flexible and comprehensive action recognition dataset. This dataset has been built in an indoor environment and contains a total of 90 videos. There are 10 classes of different actions which are described in Tab. 1. Every action is performed by 9 actors in each class.
KTH dataset: The KTH action dataset  includes a total of 599 videos of 6 action classes, which are described in Tab. 1. Each action class is completed by 25 actors in four distinct situations like outdoors, scale variation, in outdoors, outdoors with distinct clothes and lighting variations in indoors.
Muhavi dataset: The Muhavi action dataset  involves a total of 17 actions and each action is completed by 14 persons. Eight cameras are located on different views for the recording of human actions. A total of 10 actions are considering in this work for classification, depicted in Tab. 1.
WVU Multi-view dataset: The WVU multi-view action dataset  consists of total of 780 action videos. This dataset consists of 12 human actions and every action is performed by 2 persons. Eight different view cameras are located for human action recording. Tab. 1 depicts the selected action for classification.
3.2 Evaluation Methods
The proposed framework is validated on four large action datasets: Weizmann, KTH, Muhavi, and WVU. The selected action classes and their respective class labels are depicted in Tab. 1. To assess the proposed method performance, 10-fold cross-validation is made on all three datasets. The MC-SVM is used for action recognition and we compare their performance with eight other classification algorithms: Fine-KNN, weighted-KNN, ensemble boosted tree (EBT), subspace discriminant analysis, DT, QDA, logistic regression, and Q-SVM. To measure the authenticity of the proposed algorithm, we implement five statistical measures of accuracy: FNR, precision, sensitivity, FPR, and correct recognition rate (CCR). The proposed performance is compared with PCA based features reduction model and then a comparison is made with existing methods. MATLAB 2019b based simulations are carried out on a personal computer.
3.3 Results and Discussion
The proposed framework is workflow of five major steps such as preprocessing, segmentation of ROI, features extraction and reduction, features selection, and recognition whereas each step is series of sub-steps as shown in Fig. 1. The proposed framework is evaluated in two stages: a) Features reduction has been carried out by PCA which is then sent to MC-SVM for recognition; b) features reduction is performed by the novel ES method and then GA base selected features are provided to MC-SVM for recognition. The detailed description of each of these modules is given in Fig. 7. Four publicly available datasets, namely, Weizmann, KTH, Muhavi, and WVU multi-view are selected for evaluation. For testing and training 50:50 strategy is adopted. A comprehensive comparison of the proposed algorithm is performed with eight classifiers and their performance is evaluated by five measures such as sensitivity, precision, FPR, FNR, and CRR. Additionally, we also compare our proposed method with existing works on the selected datasets just to support our claim of achieving best accuracy even with the most recent articles.
Fig. 9 summarizes the results of features reduction by PCA and muti-class SVM. The multi-class SVM achieved best recognition results of 91.7%, 98.9%, 99.8%, and 99.90% on Weizmann, KTH, Muhavi, and WVU muti-view dataset, respectively. Moreover, the average recognition execution time of PCA based reduction approach for selected datasets is 51.729 s.
The proposed ES based features reduction and the GA based features selection results are shown in Tab. 2. It is evident that the proposed method achieved best recognition results of 96.80%, 100%, 100%, and 100% on Weizmann, KTH, Muhavi, and WVU multi-view datasets, respectively. The recognition rate of the proposed method is explained by the confusion matrix in Tab. 3. The selected classifiers such as W-KNN, Q-SVM, F-KNN, and QDA also achieved maximum recognition rate of 100% on the WVU multi-view dataset. The average recognition execution time for the proposed ES based reduction and GA based selection is 23.901 s, which is significantly lower as compared to PCA.
Finally, the proposed method results are compared with existing HAR methods for all selected datasets as given in Tab. 4. In this table, the proposed method is evaluated on Weizmann dataset and achieved recognition accuracy of 96.80%, that when compared with existing approaches such as  shows improved performance. Secondly, the proposed recognition accuracy on the KTH dataset is 100%, that is quite good performance as compared to . Similarly, the recognition performance for the proposed algorithm on WVU and Muhavi datasets is 100%, that is significantly robust as compared to [31,32]. From the experimental results, this is quite evident that the proposed feature selection approach performs better as compared to PCA based feature selection. It is noted that our proposed algorithm outperforms existing techniques in terms of recognition rate. The visual results are shown in Fig. 8, where we can accurately observe the binary results and in turn get the most accurate label.
In this article, we have introduced the Entropy Skewness (ES) based feature reduction and classification approach for the segmentation of regions of interest. The reduced features are optimized by a custom made genetic algorithm and the prominent features are selected, which are then provided to the multi-class classification algorithm (MC-SVM) for the classification of multiple action classes. The ES based features reduction technique performs far better as compared to PCA. The proposed system is evaluated on four publically available datasets including Weizmann, KTH, Muhavi, and WVU. Excellent results have been obtained with the recognition accuracy of 96.80%, 100%, 100%, and 100% respectively. We noticed that the proposed algorithm performs significantly better for a limited number of testing samples demonstrating scalability and efficiency of the proposed approach. The main limitation of this work is the limited number of training and testing samples. In future, we will focus on more complex action recognition challenges such as detecting suspicious behavior and forensic analysis of moving objects. To achieve this, we will investigate deep learning features to accurately and efficiently recognize complex movements.
Funding Statement: This research was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0012724, The Competency Development Program for Industry Specialist) and the Soonchunhyang University Research Fund.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|