Human Gait Recognition: A Deep Learning and Best Feature Selection Framework

: Background —Human Gait Recognition (HGR) is an approach based on biometric and is being widely used for surveillance. HGR is adopted by researchers for the past several decades. Several factors are there that affect the system performance such as the walking variation due to clothes, a person carrying some luggage, variations in the view angle. Proposed —In this work, a new method is introduced to overcome different problems of HGR. A hybrid method is proposed or efficient HGR using deep learning and selection of best features. Four major steps are involved in this work-preprocessing of the video frames, manipulation of the pre-trained CNN model VGG-16 for the computation of the features, removing redundant features extracted from the CNN model, and classification. In the reduction of irrelevant features Principal Score and Kurtosis based approach is proposed named PSbK. After that, the features of PSbK are fused in one materix. Finally, this fused vector is fed to the One against All Multi Support Vector Machine (OAMSVM) classifier for the final results. Results —The system is evaluated by utilizing the CASIA B database and six angles 00 ◦ , 18 ◦ , 36 ◦ , 54 ◦ , 72 ◦ , and 90 ◦ are used and attained the accuracy of 95.80%, 96.0%, 95.90%, 96.20%, 95.60%, and 95.50%, respectively. Conclusion —The comparison with recent methods show the proposed method work better.

are being utilized for preprocessing of frames such as watershed, thresholding, and background removing [20].
After the process of preprocessing, the attributes of computing is a significant step [21] and is used to compute the features from image frames. The main interest is to compute the important attributes and eliminating the rest. The system efficiency is affected if there are irrelevant features. Therefore, the main concern is the extraction of relevant features only. After the computation of features, another concern is to reduce the dimensionality of these features. The feature reduction method is used to improve system efficiency by working on relevant features only and eliminating the redundant [22]. Many techniques are impelemented in literature for features reduction and selection such as entropy based selection [23], variances based reduction [24], and name a few more [25].
Plenty of research has been carried out by the researchers on HGR recently. Several techniques are applied for the recognition such as (i) HGR based on human silhouette; (ii) HGR features based on the traditional or classical methods; (iii) methods based on deep learning. The techniques based on the human silhouette are very slow and space-taking. Sometimes the incorrect silhouette gives incorrect and irrelevant attributes that affect the system reliability. The feature computation through the classical techniques is based on the low level of attributes and they are concerned with a specific problem. Therefore a fully automized system is needed for the feature computation which gives a high level of the descriptor. For the computation of high-level descriptor, a lot of techniques are offered in the literature work. There exist several problems that affect the system reliability such as various carrying conditions, clotting problems, variations in the view angles, insufficient lighting, the speed of a person, shadow of feet. These factors distort the human silhouette thus leads to inaccurate features. To address these factors the main contributions in the field of the HGR are: a) Frames transformation based on HSV and selection of the best channel which gives the maximum features and information. b) The computation of deep features by utilizing the VGG-16 pre-trained model with the help of transfer learning. c) Selection of high-quality attributes with the help of a hybrid approach based on Principal Score and Kurtosis (PSaK). d) Merging the selected attributes and fed to the One-against-All Multi SVM (OAMSVM). Section 2 represents the related work of this study. The proposed work which includes finetune deep models, selection of important features, and recognition, is discussed in Section 3. Results and comparison are discussed in Section 4. The conclusion is given in Section 5.

Related Work
Several techniques are used for HGR recently to recognize a person from the walking pattern. Castro et al. [26] deployed a method for HGR that is based on CNN features. In this technique, high-level descriptor learning has been done by using low-level features. To test their method, they used an HGR dataset called TUM-GAID and during experimental analysis, they reach an accuracy of 88.9%. Alotaibi et al. [27] instituted an HGR system that is based on CNN attributes. In this method, they tried to minimize the problem of occlusion that degrade system efficiency.
To handle the problem of small data, they carried out augmentation of the data. Fine-tuning on the dataset is also carried out. CASIA-B database is used to assess the system performance. The 90 • angle of CASIA-B dataset is used and achieved the accuracy of 98.3%, 83.87%, and 89.12% on three variations nm, bg, and cl accordingly. Li et al. [28] deployed a new network called DEEPNET in which they tried to minimize the problem that comes due to view variations. For solving this problem, they adopted the Joint Bayesian. The normalization of the gait phase is done by using Normalized Auto Correlation (NAC). After the normalization, the gait attributes are computed. For the assessment of the system, the OULP database is used and an accuracy of 89.3% is attained. Arshad et al. [4] used a method for HGR to sort out the problem of the different variations. For the computation of gait features, two CNN models VGG26 and AlexNet are used. The feature vector is computed by using entropy and Kurtosis. After computation, both vectors are fused. Fuzzy Entropy Controlled Kurtosiss (FEcK) is utilized for the selection of best features. Experimental analysis has been done on AVAMVG by achieving 99.8%, CASIA-A by achieving 99.7%, CASIA-B by achieving 93.3%, and CASIA-C by achieving 92.2% of recognition rate. To overcome the dilemma of angle variation Deng et al. [29] deployed a new HGR method. The fusion of knowledge and deterministic learning is done in this method. CASIA-B database is used for experimental analysis and 88%, 87%, and 86% recognition rate is attained on three angles 18 • , 36 • , and 54 • accordingly.
Mehmood et al. [5] addresses the problem of variation by using a hybrid approach for feature selection. The gait attributes are computed from the image frames by using DenseNet-201. Two layers avg_pool and fc1000 are used for the computation of attributes. The parallel order method is used to merge these features. For the selection of attributes, an algorithm based on Kurtosis and firefly is used. CASIA-B is used to assess the system's performance. The accuracy of 94.3% on 18 • , 93.8% on 36 • , and 94.7% on 54 • angle are attained, respectively. Rani et al. [30] introduced an ANN-based HGR system to identify a person from the way he walks. Image preprocessing is done background subtraction. The morphological based operation is used for tracking of image silhouette. The self-similarity-based technique is used for the assessment of the system. They evaluated the system on the CASIA dataset and observed the better performance of the system as compared to current techniques.
Zhang et al. [31] introduced a novel method to minimize the drawback of variation of clothes, angles, and carrying things. LSTM and CNN are used for the computation of attributes from RGB image frames. After that, the attributes are fused. The assessment of the system is done by using CASIA-B, FVG, and USF by achieving 81.8%, 87.8%, and 99.5% accordingly. To conquer the problem of covariations Yu et al. [32] instituted a new approach. CNN is utilized to compute the feature from the images. A Stack progressive based autoencoder is utilized to deal with the problem of variations. For reduction of features, PCA is utilized and final features are fed to the KNN algorithm. The approach is tested SZU RGB D and CASIA-B and a recognition rate of 63.90% with variations and 97% without variations were achieved. Marcin et al. [33] presented a new technique and analyzed that how different types of shoes affect the walking style of people. The total walking cycles of 2700 were analyzed obtained from 81 individuals. The accuracy of 99% was achieved on the dataset of 81 individuals. Khan et al. [34] introduced an HGR approach and used the sequence of video to compute the attributes. In this technique, a codebook is generated. After the generation of the codebook, the encoding of vector-based on fisher vector is done. CASIA-A and TUM GAID are used for assessment of the system and 100% and 97.74% recognition rate was attained, respectively.

Proposed Methodology
A fully mechanized HGR system is proposed that is based on very deep features of neural networks. The proposed method is based on four steps such as: preprocessing of image frames, feature extraction through CNN model VGG-16, feature selection through a novel combined method, and at last final recognition with the help of supervised learning. The complete architecture of the system is illustrated in Fig. 2.

Frame Preprocessing
Preprocessing plays an important role in CV and image processing to improve the quality of the given data [35]. Preprocessing includes, resizing of images, background removal, noise removal, and changing the color space of the image such as RGB to Gray. In this work, preprocessing is carried out to prepare data for the neural network. Initially, the image resizing is carried out. After that minimum set count based on all classes is done. Later the HSV is performed and the best channel is selected. Mathematically, HSV transformation is specified as: The R, G, B values are divided by 255 to change the range from 0.255 to 0.1: where, ω h , ω s , and ω v symbolize three channels of HSV conversion. The notation ω r , ω g , and ω b demonstrates red, green, and blue channels of the original image frame. After that, the best information channel ω h is selected which is presented in Fig. 3. This figure specifies the first channel of HSV and the best channel is further processed.

Deep Learning Feature Computing
Feature computing is a very important part of machine learning and pattern recognition [36,37]. The main objective of this step the extraction of important features from the objects that are presented in the image frame. After the computation of the feature, the next step is the prediction of the object category [38]. Numerous types of features are available such as: based on geometry, shape, and texture. By utilizing these features the author tries to achieve higher accuracy but fails due to a huge dataset. Deep learning is becoming important and being used by many researchers because it works efficiently on large as well as small databases. Convolutional Neural Network (CNN) is famous types of layers-based networks that are used to extract the efficient and relevant features of the object [14]. The CNN model is based on pooling, ReLu, convolution, softmax, and fully connected (FC) layers. Low-level feature extraction is carried out by convolution layer and high-level information is obtained on FC layers.
In the proposed work, a pre-trained CNN model name VGG-16 is applied for the computation of features. The computation of features based on transfer learning is computed from the third last layer named fc6 and the second last layer named fc8. The features of both layers are combined in one matrix which proceeds in the next step. The description of each step involved in the VGG-16 model is described as follows:

VGG-16
VGG-16 is a famous CNN model that is being used for efficient feature computation. The size of the input image used for VGG-16 is 224 × 224 × 3. This network can be used for RGB images also. The architecture of VGG-16 is based on the input layer, 5 layers of the max pool, five segments of convolution layers having 13 layers of the total, and 3 Fully Connected (FC) layers. The filters of 3 × 3 are used in the first two convolutional layers. The filter of size 3 × 3 with stride 1 is used. A total of 64 filters are used in the first two layers and it gives an output of 224 × 224 × 64. After that, the pooling layer is used, and it gives an output of 112 × 112 × 64. After the pooling layer, two more convolution layers are used having a total 128 of filters and it gives us the output size of 112 × 112 × 128. After these convolution layers pooling layer is used, and it gives the output of size 56 × 56 × 128. After this pooling layer, two more convolution layers are added having a 256 filter. Then pooling layer is added. After that, 3 convolution layers are added having filters of 512. After this convolution layer, the pooling layer is added again. After that, 3 convolution layers are added having filters of 512. Then pooling layer is added again after these convolution layers. Finlay, the fully connected layers are added, and the final output size is 7 × 7 × 512 into the FC layer. There is a total of three FC layers. The total channels at the first two FC layers are 4096 and at the third FC layer is 1000. ReLU activation function is used in all hidden layers. The architecture of the VGG-16 is illustrated in Fig. 4. In this study, deep feature extraction is carried out by a pre-trained VGG-16CNN model. The features are computed on two layers fc6 and fc8 for the best extraction of the features. After the activation function, size of vectors N×4096 is on fc6 and N×1000 on fc8 is obtained, where N represents the number of images. After the extraction of features, the features of both layers are merged lineally.

Transfer Learning-Based Feature Extraction
Feature extraction is performed by transfer learning (TL) [39] on pre-trained VGG-16. For this purpose, the VGG-16 structure was trained on various angles of the HGR database CASIA-B. Activation is performed on fc6 and fc8 layers of the network and features of both the layers are merged in parallel order. The input size of the image was 224 × 224 at the input layer. Therefore, feature extraction through TL was performed on fc6 and fc8. The number of output feature maps on both layers is N × 4096.
Let v 1 denotes the fc6 feature vector of dimension N × 4096 and v 2 denotes the fc8 feature vector of dimension N × 1000, respectively. Let v 3 denote the matrix of feature vector after fusion represents the fused feature matrix of dimension N × T1 where T1 represents the fused matrix length. The length of the fused matrix is depending on the maximum dimensional vector like v 1 or v 2 .
The maximum dimensional vector is first computed before the fusion of both vectors as shown below.
The maximum length vector is specified through this expression and a blank array is found to make both vector's length equal. This is performed by a simple subtraction operation to find the difference in the length of vectors. The mean value of maximum vector T1 is computed and placed at the lower dimensional vector instead of using zero paddings. The maximum index of both vectors is identified by defining a threshold. The threshold is mathematically defined as follows: In the above equation, the indexes of both vectors are compared and after that concatenation is performed as follows:

Feature Selection
Feature selection is carried out by applying heuristic approach baed on Principle Score and Kurtosis to only select important features and eliminating the less important features.

Kurtosis:
After feature extraction through the VGG-16 deep network, a heuristic-based approach kurtosis is applied on the computed feature vector FV. The main goal of using this approach is to select the top features and eliminating the rest. The kurtosis is formulated as: whereX is mean value s shows the standard deviation and N refers the vector size. In this process, a feature vector of dimension N× R1 is obtained and denoted by ξ(rj). The N represents a feature vector that includes the top features of each frame. The N represents a feature vector that includes the top features of each frame.

Principle Component Analysis:
Principal component analysis (PCA) is a statistical technique that is based on linear transformation. PCA is very useful for pattern recognition and data analysis and us is widely used in image processing and computer vision. It is used for data reduction and compression and also for decorrelation. Numerous algorithms that are neural network and multiple variation-based are being utilized as PCA on the various dataset. PCA can be defined as the transformation of n number of vectors having N length and n dimension vector of Y = [y 1 , y 2 , . . . , y n ] T and this vector is formed in x.
A simple formula can be generated from this concept, but it is mandatory to remember that a single input of vector y has N values. The m y vector can be described as mean values of input variables and this can be demonstrated by a relation.
The matrix B is based on the covariance matrix C y . The formation of rows is based on the eigenvectors v of C y . The order of these eigenvectors is descending. The matrix C y can be evaluated as follows: We know that the vector y is an n dimensional vector so the size of the matrix will be n x n and the elements the variance of y lying on the main diagonal are C y (i, i).

C y (i, i)= F{(y
For C y (i, j) the covariance is based on other points y i , y j and can be determined as follows.
The rows of B are orthonormal so the inversion of PCA is conceivable as follows: Due to these properties, PCA can be used in image processing and computer vision. In this process, a feature matrix of dimension N × R2 is obtained and denoted by y. Finally, by using Eq. (6), a final matrix of dimensional N × R3 is obtained which is fed to OAMSVM [40] for final recognition.

Results and Analysis
In this section, the proposed system validation is illustrated. The system is tested on CASIA-B [41] database that is publicly available. This database is used for Multi-view HGR.

Implementation Details
The standard ratio of 70:30 is used for the evaluation of the system. It means that for training purposes 70% of image frames are utilized and for the testing purpose 30% of the image frames are used. The cross-validation of 10 K-Fold is used while the rate for leaning was 0.0001. Transfer Learning is used for the training and testing features computation. After the computation of features, these features along with the labels are fed into M-SVM (Linear method) for final classification. After that, the trained model is used to predict the test features. All experiments are done by using MATLAB 2018b, running on a Core i7 machine that contains 16 GB of RAM and 4 GB NVIDIA GeForce 940MX GPU. Moreover. Matcovnet, a toolbox for deep learning is used to compute deep features.

CASIA B Dataset
CASIA-B is a database that is publicly available and use of human gait recognition. The database was extracted in the indoor environment. There are several variations in the database such as various angles, carrying, and clothing conditions. This dataset is based on 124 subjects and this database is extracted from 11 different angles. These angles are 0

Exp 1: 0 Angle
The results on 00 • angle are given in this section. The results are illustrated in Tab

Exp 3: 36 Angle
The results on 36 • angle are given in this section. The results are illustrated in Tab

Exp 3: 54 Angle
The results on the 54 • angle are given in this section. The results are illustrated in Tab

Exp 4: 72 Angle
The results on 72 • angle are given in this section. The results are illustrated in Tab

Discussion and Comparison
A detailed discussion is provided in this section. As demonstrated in Fig. 2, the introduced system is based on few steps such as computation of deep feature by utilizing pre-trained model of VGG-16, feature fusion, feature selection with the help of PCA and Kurtosis. After that, powerful features are selected and combined. Finally, the feature vector is fed to Multi-SVM of one against all methods. The proposed system is assessed by using six different angles of the CASIA B dataset such as 00 • , 18 • , 36 • , 54 • , 72 • , and 90 • . The results of the angles are calculated separately and demonstrated in Tabs. 1-6, respectively. The computational time against each angle is also computed. An extensive comparison has been carried out with the recent HGR methodologies to assess the proposed methodology as shown in Tab. 7. Mehmood et al. [5] introduced a hybrid feature selection HGR method based on deep CNN. They used the CASIA B database for the assessment for techniques and attained the recognition rate of 94.3%, 93.8%, and 94.7% on 18 • , 36 • , and 54 • angles accordingly. Ben et al. [42] introduced an HGR technique called CBDP to address the problem of view variations. CASIA B dataset is utilized to assess the system and used CASIA-B dataset and accuracy of 81.77% on 00 • , 78.06% on 18 • , 78.6% on 36 • , 80.16% on 54 • , 79.06% on 72 • and, 77.96% on 90 • angles is achieved. Anusha et al. [43] advised a novel HGR method based on binary descriptors and feature dimensionality reduction is also used in the method. CASIA B dataset is utilized to assess the system performance and the accuracy of 95.20%, 94.60%, 95.40%, 90.40%, and 93.00% is attained on 00 • , 18 • , 36 • , 54 • , and 72 • angles consequently. Arshad et al. [8] an HGR methodology based on binomial distribution and achieved the recognition rate of 87.70% on CASIA B using 90 • angle. Zhang et al. [31] suggested an encoder-based architecture for HGR to address the problem of variations by utilizing LSTM and CNN-based networks. The system was assessed by using the CASIA B database and attained a recognition rate of 91.00% on 54 • . In case of our proposed HGR method the recognition rate of 95.80%, 96.0%, 95.90%, 96.20%, 95.60%, and 95.50% is obtained on 00 • , 18 • , 36 • , 54 • , 72 • , and 90 • angles consequently. The strength of this work is selection of best features. The limitation of this work is less number of predictors for final classification.

Conclusion
HGR is a biometric-based approach in which an individual is recognized from the walking pattern. In this work, a new method is introduced for HGR to address various factors such as view variations, clothes variations, and different carrying conditions. A person wearing a coat, a person walking normally, and a person carrying a bag. In this work, the feature computation has been carried out by using pre-trained network VGG-16 instead of using classical feature methods such as color-based, shape-based, geometric-based features. A PCA and Kurtosis based method is used for the reduction of the features. Six different angles of CASIA B are utilized to assess the performance of the system and attained an average recognition rate of more than 90%, which is better than the recent techniques. The result demonstrated in this work, this can be easily verified that CNN based features give better performance in the sense of better attributes and accuracy. Deep features work well for both small datasets and as well as for large databases. Overall, the introduced approach works well for the various angles of CASIA B and gives a good performance. However, the introduced method is inefficient in the case of small data. In the future, the same approach can be applied for different angles of CASIA B and other HGR databases.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.