Human Gait Recognition Using Deep Learning and Improved Ant Colony Optimization

Human gait recognition (HGR) has received a lot of attention in the last decade as an alternative biometric technique. The main challenges in gait recognition are the change in in-person view angle and covariant factors. The major covariant factors are walking while carrying a bag and walking while wearing a coat. Deep learning is a new machine learning technique that is gaining popularity. Many techniques for HGR based on deep learning are presented in the literature. The requirement of an efficient framework is always required for correct and quick gait recognition.We proposed a fully automated deep learning and improved ant colony optimization (IACO) framework for HGR using video sequences in this work. The proposed framework consists of four primary steps. In the first step, the database is normalized in a video frame. In the second step, two pre-trained models named ResNet101 and InceptionV3 are selected andmodified according to the dataset’s nature. After that, we trained both modified models using transfer learning and extracted the features. The IACO algorithm is used to improve the extracted features. IACO is used to select the best features, which are then passed to the Cubic SVM for final classification. The cubic SVM employs a multiclass method. The experiment was carried out on three angles (0, 18, and 180) of the CASIA B dataset, and the accuracy was 95.2, 93.9, and 98.2 percent, respectively. A comparison with existing techniques is also performed, and the proposed method outperforms in terms of accuracy and computational time.


Introduction
Human identification using biometric techniques has become the most important issue in recent years [1]. Human identification techniques based on fingerprint and face detection are available. These techniques are used to identify humans based on their distinguishing characteristics. Every person has unique fingerprints and iris patterns that are used for identification [2]. Scientists are increasingly interested in human gait as a biometric approach [3,4]. In comparison to fingerprint and face recognition technologies, gait recognition has a more beneficial system. Automatic human verification and video surveillance are two important applications of gait recognition [5,6]. The HGR has recently developed a dynamic study zone in biometric applications and has received significant attention in Computer Vision (CV) research [7]. Gait is a common and ordinary behavior of all humans, but it is a very complex process because it works with the association from an examination standpoint. The human gait recognition process is divided into two approaches: model-based and model-free [8].
The model-based approach directs human movement based on prior knowledge [9], whereas the model-free approach generates sketches of the human body known as posture generation or skeletons [10]. The model-based approach analyses human behaviors based on joint movement and upper/lower body parts. The model-free approach, on the other hand, is simpler to implement and requires less computational time. Many computerized techniques are used in the literature to automate this application [11]. Computer vision researchers used methods based on both classical and deep learning techniques. In traditional techniques, recognition is accomplished through a series of steps such as data preprocessing, segmenting the region of interest (ROI), feature extraction, and classification. The authors used contrast enhancement techniques during the preprocessing step [12]. In the following step, several segmentation techniques are used to extract the ROI. This is followed by the features extraction step, which extracts texture, shape, and point features. These features are enhanced further by feature reduction techniques such as PCA, Entropy, and a few others [13]. In recent years, the introduction of deep learning into machine learning has demonstrated great success in a variety of applications, including biometrics [14], surveillance [15,16], and medicine [17,18]. A simple deep learning model did not necessitate preprocessing or the use of raw data. Several hidden layers are used to extract the features. Convolutional layers, ReLU layers, max-pooling, and batch normalization are examples of hidden layers. The features are combined in one dimension at fully connected layers classified as Softmax layers [19].
Mehmood et al. [20] presented a novel deep learning-based HGR framework. The presented method consisted of four significant steps: preprocessing of video frames, modification of pretrained deep learning models, exploiting only the best features with a firefly algorithm, and finally, classification. In this work, the fusion is also used to improve the representation of extracted features. The experiment was carried out on three CASIA B dataset angles: 18, 36, and 54. The calculated accuracy for each angle was 94.3 percent, 93.8 percent, and 94.7 percent, respectively. Anusha et al. [21] used optimal binary patterns to implement the HGR. They considered the problem of view-invariant clothing and conditions. MLOOP is the name given to the extracted binary patterns. They used MLOOP to extract histogram and horizontal width features and then reduced the irrelevant ones using a reduction technique. Two datasets were used in the experimental process, and they performed admirably. Arshad et al. [22] presented a deep learning and best features selection approach for HGR with various view invariants and cofactors. Two pre-trained models were used in this method, which was modified for feature extraction. In the following step, parallel approach-based features are fused and improved further using fuzzy entropy and skewness-based formulation. The experiment was carried out on four available datasets and yielded accuracy rates of 99.8, 99.7, 93.3, and 92.2 percent, respectively. Sugandhi et al. [23] proposed a novel HGR method based on frame aggregation. This work presents two features: the first is designed using dynamic variations of human body parts, and the second is based on first-order statistics. The frames in the first feature are divided into block cycles. The features level fusion is used in the final and executed classification. The experimental process was carried out on the CASIA B dataset and resulted in improved accuracy.
Based on these studies, we consider the following challenges of this work: I change in human view angle; ii) change in a human wearing condition such as clothes, etc.; iii) change of human characteristics during walking styles such as slow walk, fast walk, etc.; iv) deep learning model requires a large amount of data to train a good model, but it is not always possible to obtain data due to various factors. We proposed a new deep learning and Improved Ant Colony Optimization framework for accurate HGR to address these issues.
• In terms of fully connected layers, modified two pre-trained models, VGG16 and ResNet101, and added a new layer with the connection of the preceding layers. • A features selection technique is proposed name improved ant colony optimization (IACO).
In this approach, features are initially selected using ACO and then refined using an activation function based on the mean, standard deviation, and variance. • Used the IACO on both modified deep learning models to compare accuracy. The best one is considered for the final classification based on accuracy.

Proposed Methodology
This section describes the proposed human gait recognition method. Fig. 1 depicts the main flow diagram of the proposed approach. Preprocessing datasets, feature extraction using pretrained models, feature optimization, and classification are the main steps in this method. Deep transfer learning is used to modify two pre-trained models, Resnet 101 and Inception V3. The features are then extracted from both modified models. We get two resultant vectors as a result, which are then optimized using improved ant colony optimization (IACO). Finally, the final features are classified using multiclass classification methods.

Dataset Collection and Normalization Details
CASIA B [24] is a large multiview gait dataset that was created in January of 2005. 124 subjects are involved in the collection of this dataset. The dataset was captured by all subjects using the 11 different view angles. This dataset includes three deviations: changes in view angle, changes in clothing, and changes in carrying objects. This dataset contains three classes: walk with a bag, normal walk, and walk with a coat. We consider three angles in this work: 0, 18, and 180. Three conditions are included for each angle: bag carrying, normal walking, and wearing a coat. Fig. 2 shows a few examples of images from this dataset.

Convolutional Neural Network (CNN)
Deep learning demonstrated massive success in the classification phase of machine learning [25,26]. The convolutional neural network (CNN) is a deep learning technique. Using a convolutional operator, image pixels are convolved into features in this network. It aids us in image recognition, classification, and object detection. When compared to other classification algorithms, it requires very little preprocessing. CNN uses an image as input and then processes it through the hidden layers to classify it. The training and testing process will go through several layers, including a convolutional layer, a pooling layer, an activation layer, and a fully connected layer.

Convolutional Layer
Suppose we have some P × P fair neuron in the layers. Consider, we have n × n filter ω; then the convolutional layer has an output of (P − n + 1)× (P − n + 1). To calculate the pre-nonlinearity input to some unit x l ij in the layer, it is defined as follows:

ReLU Layer
ReLU layer is an activation layer used for the problem of non-linearity among layers. Through this layer, the negative features are converted into zero values. Mathematically, it is defined as follows:

Batch Normalization
The batch normalization is achieved through the normalization step that fixes each of the inputs layer's means and variances. Idyllically, the normalization will be conducted on the entire training set. Mathematically, it is formulated as follows: where B denotes the mini-batch of the size m of the whole training set.

Pooling Layer
The pooling layer is normally applied after the convolution layer to reduce the spatial size of the input. It is applied individually to each depth slice of an input volume. The volume depth is always conserved in pooling operations. Consider, we have an input volume of the width W 1 , height H 1 , and depth D 1 . The pooling layer requires the two hyper-parameters such as kernel/filter size G and stride Z. On applying the pooling layer on the input volume, the output dimensions of output will be as:

Average Pooling Layer
The average pool layer calculates the average value for each patch on a feature map. Mathematically, it is formulated as follows: where λ decides to use either max pooling or average pooling, the value of λ is selected randomly in either 0 or 1. When λ = 0, it behaves like average pooling, and when λ = 1, it works like max pooling.

Fully Connected Layer
Neurons in the fully connected layer (FC) have full connections to all the activations in the previous layer. The activations can later be computed with the matrix multiplication followed by the bias offset. Finally, the output of this layer is classified using Softmax classifier for the final classification. Mathematically, this function is defined as follows: where, z denotes the input vector to a Softmax function made up of (z0, . . . , zK). All the values of z i are used as input to a softmax function, and it can take any positive, zero, or negative real value. The exponential function is applied to each value as the input vector.

Deep Learning Features
In the literature, several models are introduced for classification, such as ResNet, VGG, GoogleNet, InceptionV3, and named a few more [27]. In this work, we utilized two pre-trained deep learning models-ResNet101 and InceptionV3. The detail of each model is given as follows.

Modified ResNet101
ResNet represents the residual network, and it has a significant part in computer vision issues. ResNet101 [28] contains 104 convolutional layers comprised of 33 blocks of layers, and 29 of these squares are directly utilized in previous blocks. Initially, this network was trained on the ImageNet dataset, which includes 1000 object classes. The original architecture has been illustrated in Fig. 3. This figure demonstrated that the input images are processed in residual blocks, and each block consists of several layers. In this work, we modify this model and remove the FC layer, which includes 1000 object classes. We added a new FC layer according to our number of classes. In our selected dataset, the number of classes is three, such as normal walk, walking with carrying a bag, and walking with a coat. The input size of the modified model is consistent as 224 × 224 × 3, and output is N × 3. The modified model is illustrated in Fig. 4. This figure shows that this modified model consists of a convolution layer, max pooling layer with the stride of 2, 33 residual building blocks, avg pooling layer with the stride of 7, and a new fully-connected layer. After this, we trained this modified model using transfer learning (TL) [29,30]. TL is a process of reuse a model for a new task. Mathematically, it is formulated as follows: (m, n) is the training data sizes where n m and ρ D 1 and ρ T 1 be the labels of training data. Then the TL is represented as: Visually, this process is illustrated in Fig. 5. This figure describes that the weights of original models are transferred to the new modified model for training. From the modified model, features are extracted from the feature layers of dimension N × 2048.

Modified Inception V3
This network consists of 48 layers and is trained on the 1000 object classes [31]. The input size of an image given to the network is 299 × 299 × 3, and when we pass the input to the network, it passes through the convolutional layer; there are three convolutional layers, and the size of the filter is 3 × 3. After that, we have the Max Pool layer where we have the widow size is 3 × 3 with stride 2. The actual model is comprised of symmetric and building blocks, including convolutions, normal pooling, max pooling, concatenation, dropouts, and completely associated layers. Mathematically, the representation of this network is defined as: where momentum is represented by β and value is initialized as 0.9. In this work, we utilized this model for gait recognition. The CASIA B dataset was used for training this model. The input size of the modified model is consistent as 224 × 224 × 3, and output is N × 3. The modified model is illustrated in Fig. 6. This figure illustrates that this modified model consists of a convolution layer, max-pooling layer, avg pooling layer, and a new fully-connected layer. After this, we trained this modified model using transfer learning (TL), as discussed in Section 3.4.1. The features are extracted from the average pool layer and obtained a feature vector of dimension N × 1920.

Features Optimization
Optimal feature selection is an important research area in pattern recognition [32,33]. Many techniques are presented in the literature for features optimization, such as PSO, ACO, GA, and name a few more. We proposed an algorithm for feature selection named improved ant colony optimization (IACO) in this work. The working of the original ACO [34] is given as follows: Starting Ant Optimization-The number of ants are computed as follows at the very first step: where F represents the input feature vector, w represents the width of a feature vector, and A N denotes the total number of ants used for the random placement in the entire vector, where each feature in the vector represents one ant. through pixel (e, f ) to pixel (g, h). The probability can be computed as follows:

Decision-Based on probability-The probability of the ant traveling is represented by p ij
Here, every feature location is given as e f ∈ . The p ef shows the number of pheromones, w ef represents the visibility, and its value is explained with the help of the following function: Rules of Transition-This rule is mathematically present as follow: Here, i, j represent the locations of each feature, and these pixels are traveling to a location (k, l). If q < q 0 the next pixel that the ants would visit is chosen as shown in the second part's probability distribution.
Pheromone Update-In this step, the ants are shifted from the i, j to update features location (k, l). Based on this, the path of pheromone is obtained after every iteration and mathematically defined as follow: (21) ρ ij = w ij (22) Here, η (0 < η < 1) shows the ratio of loss of pheromones. A new value of pheromones is obtained after every iteration. Mathematically, this process is formulated as follow: Here, θ(0 < θ < 1) shows the promotions of loss pheromones. New values of pheromones and ρ 0 represents the start values of pheromones. These steps are applied for all features, and in the output, we obtained an optimal feature vector. The number of iterations in this work was 100. After 100 iterations, the selected vector is obtained of dimensions N × 800 and N × 750, respectively. These vectors are obtained for both modified models ResNet101 and InceptionV3. We found some redundant features in these selected vectors during the analysis step, which affects the recognition accuracy. Therefore, we modify this method by adding one new equation. Mathematically, it is formulated as follows: Here, Act represent the activation function which selects or discard the features based on thē σ . In this step, 20%-30% of features are further removed. Based on the analysis step, we found the selected features better and utilized them for the final classification (in this work, the final feature vector size is N × 1150). The classification is conducted through multiple classifiers and chooses the best of them based on the accuracy value.

Experimental Results and Analysis
The experimental process such as experimental setup, dataset, evaluation measures, and results is discussed in this section. The CASIA B dataset is utilized in this work and divide into 70:30. Its means that 70% dataset is used for the training purpose and the remaining 30% data for testing. During the training process, we initialized epoch's 100, iterations 300; mini-batch size is 64 and learning rate 0.0001. For learning, the Stochastic Gradient Descent (SGD) optimizer is employed. For the cross-validation, the ten-fold process was conducted. Multiple classifiers are used, and each classifier is validated by six measures such as recall rate, precision, accuracy, and name a few more. All the simulation of this work is conducted in MATLAB 2020a. The system used for this work is Corei7 with 16GB of RAM and 8 GB graphics card.

Results Proposed 1
Three different angles are considered for the experimental process, such as 0, 18, and 180. The results are computed for both modified deep models, such as ResNet101 and InceptioV3. For all three angles, the results of the ResNet101 model are presented in Tabs. 1-3. These tables show that the Cubic SVM performed well using the proposed method for all three selected angles. Tab. 1 presented the results of 0 angles and achieved the best accuracy of 95.2%. The recall rate and precision rate of this cubic SVM is 95.2%. The quadratic SVM also performed well and achieved an accuracy of 94.7%. The computational time of this classifier is approximately 237 (sec); however, the minimum noted time is 214 (sec) for Linear SVM. Tab. 2 presented the results of 18 degrees. The best accuracy of this angle is 89.8% for cubic SVM. The rest of the classifier's accuracy is also better. The recall rate and precision rate of cubic SVM are 89.7% and 89.8%, respectively. The computational time of each classifier is also noted, and achieved the best time is 167.1 (sec) for linear SVM, but the accuracy is 83.5%. The difference in the accuracy of cubic SVM and linear SVM is approximately 6%.   Moreover, the time difference is not much higher; therefore, we consider cubic SVM better. Tab. 3 presented the results of 180 degrees. The maximum noted accuracy for this angle is 98.2% achieved for cubic SVM. The confusion matrix of cubic SVM for each classifier is also plotted in Figs. 7-9. From these figures, it is noted that each class has above 90% correct prediction accuracy. Moreover, the error rate is not much high.

Results Proposed 2
In the second phase, we implemented the proposed method for the modified inceptionV3 model. The results ate given in Tabs. 4-6. Tab. 4 shows the accuracy of 0 degrees using modified inceptionV3 and IACO. For this approach, the best-achieved accuracy is 92%, by CSVM, across few other calculated parameters that are recall rate, precision rate, and AUC of values 92%, 92%, and 0.97, respectively. The second-best accuracy of this angle is 91%, achieved on QSVM of 91%. Computational time is also noted, and the best time is 136.5 (sec) for linear SVM. Tab. 5 represented the results of 18 degrees. In this experiment, the best accuracy is 93.9%, by CSVM, recall rate, precision rate, and AUC values 93.9%, 93.9%, and 0.99. The second-achieved accuracy of 93% by FKNN, and the other parameter are Recall rate, Precision rate, and AUC is 93.1%, 93%, and 0.95. The computational time of each classifier is also noted, and the best time is 415 (sec) for cubic SVM. Tab. 6 presented the results of 180 degrees and achieved the best accuracy of 96.7% for CSVM and recall rate, precision rate, and AUC of 96.7%, 96.7%, and 1.00, respectively. The accuracy of cubic SVM for all three angles is verified using confusion matrixes, illustrated in Figs. 10-12. From these figures, it is shown that each class's correct prediction accuracy is above 90%.

Conclusion and Future Work
First, a brief discussion of the results section has been added to analyze the proposed framework. The results show that the proposed framework performed well on the chosen dataset. The accuracy of 0 and 180 degrees is better for modified ResNet101 and IACO, while the accuracy of 18 degrees is better for improved inceptionV3 and IACO. When compared to inceptionV3, the computational cost of improved ResNet101 and IACO is lower. Furthermore, the original computational cost of modified ResNet101 and InceptionV3 is nearly three times that of the proposed framework (applying after the IACO). Tab. 7 also includes a fair comparison with the most recent techniques. In this table, it is demonstrated that the proposed accuracy outperforms the existing techniques. Based on the results, we can conclude that the IACO aids in improving recognition accuracy while also reducing computational time. Because the accuracy of improved deep models is insufficient and falls short of recent techniques, we proposed an IACO algorithm. The choice of a deep model is the main limitation of this work because we consider both models instead of just one. In future studies, we will take this challenge into account and optimize any single deep model for HGR.