Smart Deep Learning Based Human Behaviour Classification for Video Surveillance

: Real-time video surveillance system is commonly employed to aid security professionals in preventing crimes. The use of deep learning (DL) technologies has transformed real-time video surveillance into smart video surveillance systems that automate human behavior classification. The recognition of events in the surveillance videos is considered a hot research topic in the field of computer science and it is gaining significant attention. Human action recognition (HAR) is treated as a crucial issue in several applications areas and smart video surveillance to improve the security level. The advancements of the DL models help to accomplish improved recognition performance. In this view, this paper presents a smart deep-based human behavior classification (SDL-HBC) model for real-time video surveillance. The proposed SDL-HBC model majorly aims to employ an adaptive median filtering (AMF) based pre-processing to reduce the noise content. Also, the capsule network (CapsNet) model is utilized for the extraction of feature vectors and the hyperparameter tuning of the CapsNet model takes place utilizing the Adam optimizer. Finally, the differential evolution (DE) with stacked autoencoder (SAE) model is applied for the classification of human activities in the intelligent video surveillance system. The performance validation of the SDL-HBC technique takes place using two benchmark datasets such as the KTH dataset. The experimental outcomes reported the enhanced recognition performance of the SDL-HBC technique over the recent state of art approaches with maximum accuracy of 0.9922. presents a smart deep learning-based human behavior classification (SDL-HBC) model for real-time video surveillance. The proposed SDL-HBC model majorly aims to employ an adaptive median filtering (AMF) based pre-processing to reduce the noise content. In addition, the capsule network (CapsNet) model utilized for the extraction of feature vectors and the hyperparameter tuning of the CapsNet technique takes place using the Adam optimizer. Finally, the differential evolution (DE) with stacked autoencoder (SAE) model is applied for the classification of human activities in the intelligent video surveillance system. The simulation result analysis of the SDL-HBC technique is carried out against two benchmark datasets namely KTH datasets.


Introduction
Human action recognition (HAR) and classification techniques have various applications that are helpful in day-to-day lives. Video surveillance is employed in smart supervision systems in banks, smart buildings, and parking lots [1]. Communication between machines and human is a major challenge, i.e., performed by many different methods namely hand gesture classification and speech recognition [2]. The process of video frames acquired from security camera with the help of recognizing and controlling abnormal behavior creates an automated care monitoring scheme as a human action detector [3]. Furthermore, the many elderly and sick people living alone and needing to be checked by constant surveillance triggers the need for an intelligent system that is beneficial and essential to monitor elder people. Various factors are essential in the efficacy of action detection systems like the background of the location, any abnormality condition, and detection time. The consequence of all the factors in the study of objects and the kind of behavior and actions identify the classification and recognition of the behavior [4]. Especially, in partial behavior, just the topmost part of the body is employed for recognizing hand gestures. Analysis of Human behavior from a captured video needs a preprocessing phase involving foreground and background recognition, also tracking individuals in successive frames.
Other important steps include feature extraction, appropriate model or classifier selection, and lastly the procedure of authentication, classification, and detection-based feature extraction. The initial phase for object behavior detection is recognizing the movement of the object in an image and its classification. The more commonly known method for the detection of moving objects is background subtraction [5]. The simplest method of background subtraction can be accomplished by comparing all the frames of the video with a static background. As stated, afterward the preprocessing phase, the automated recognition systems will include two major phases: feature extraction and action classification [6]. The most significant phases in the behavior analysis method are creating an appropriate feature vector and feature extraction. This process will create the primitive information for the classification.
Accurate recognition of action is one of the difficult processes to alter in clutter backgrounds and viewpoint variations. Hence, we can emphasize, that one of the most popular methods for HAR employs engineered motion [7] and texture descriptor evaluated about Spatio-temporal interest point. Additionally, many approaches follow the traditional method of pattern recognition [8]. This approach is depending on two major phases: learning classifier based on the attained feature and calculating difficult handcrafted features in the video frame. In real-time scenarios, it is uncommonly known that feature is significant to the task at hand because the selection of features is extremely problemdependent [9]. This paper presents a smart deep learning-based human behavior classification (SDL-HBC) model for real-time video surveillance. The proposed SDL-HBC model majorly aims to employ an adaptive median filtering (AMF) based pre-processing to reduce the noise content. In addition, the capsule network (CapsNet) model utilized for the extraction of feature vectors and the hyperparameter tuning of the CapsNet technique takes place using the Adam optimizer. Finally, the differential evolution (DE) with stacked autoencoder (SAE) model is applied for the classification of human activities in the intelligent video surveillance system. The simulation result analysis of the SDL-HBC technique is carried out against two benchmark datasets namely KTH datasets.

Literature Review
Nikouei et al. [10] introduced a Single Shot Multi-Box Detector (SSD), lightweight Convolution Neural Networks (L-CNN), and depth-wise separable convolution. With narrowing down the classifier's search space for emphasizing human objects in surveillance video frames, the presented L-CNN method is capable of detecting pedestrians with reasonable computational workloads to an edge device. Nawaratne et al. [11] presented the incremental spatiotemporal learner (ISTL) for addressing limitations and challenges of anomaly localization and detection for real-time video surveillance. ISTL is an unsupervised DL method that employs active learning with fuzzy aggregation, to repetitively distinguish and update amongst new normality and anomalies which evolve.
Bouachir et al. [12] designed a vision-based methodology for automatically identifying suicide by hanging. These smart video surveillance systems operate by depth stream given by the RGB-D camera, nevertheless of illumination condition. The presented approach is depending on the exploitation of the body joint position for modeling suicidal behaviors. The static and dynamic pose features are estimated for effectively modeling suicidal behaviors and capturing the body joint movement. Wan et al. [13] developed a smartphone inertial accelerometer-based framework for HAR. The data are pre-processed by denoising, segmentation, and normalization for extracting valuable feature vectors. Furthermore, a real-time human activity classification-based CNN method has been presented that employed a CNN to local feature extraction.
Han et al. [14] presented an approach of data set remodeling by transporting parameters of ResNet-101 layers trained on the ImageNet data set for initializing learning models and adapting an augmented data variation method for overcoming the over-fitting problem of sample deficiency. To model structure improvements, a new deep 2-stream ConvNets was developed for action complexity learning. Ullah et al. [15] projected an improved and effective CNN-based method for processing data stream in real-time, attained from visual sensors of non-stationary surveillance environments. At first, the frame-level deep feature is extracted by a pre-trained CNN method. Then, an enhanced DAE is presented for learning temporal variations of the action from the surveillance stream.

The Proposed Model
In this study, a novel SDL-HBC technique has been derived for the recognition of human behavior in intelligent video surveillance systems. The proposed SDL-HBC technique aims to properly determine the occurrence of several activities in the surveillance videos. The SDL-HBC technique encompasses several stages of operations such as AMF based pre-processing, CapsNet based feature extraction, Adam optimizer-based hyperparameter tuning, SAE-based classification, and DE-based parameter tuning.

AMF Based Pre-Processing
Primarily, the AMF technique is used to pre-process the input image to eradicate the noise that exists in it [16]. The AMF technique makes use of the median value of the windows for replacing the intermediate pixels treated by the window. If the intermediate pixels are (Pepper) or (salt), it gets substituted using the intermittent value of the window. The AMF follows the replacement process with the median value of the window [17]. It generally operates in the following ways: The window gets arranged in ascending order. Then, the median value can be considered as the intermediate value next to the sorting process. Thus, the pixels can be substituted by the median value.

Feature Extraction Using Optimal CapsNet Model
At this stage, the preprocessed image is passed into the CapsNet model to derive the useful set of feature vectors. The CNN model can be utilized as an effective method for performing the 2D object recognition process. Because of the data routing process in the CNN model, the details, such as position and pose in the objects, are not considered. For resolving the issues of the CNN model, a new network model named CapsNet is derived. It is a deep network approach, which comprises a set of capsules. The capsule consists of a collection of neurons. The activation neuron indicates the feature of the elements that exist in the object. Every individual capsule is accountable to determine the individual element in the object and every capsule can integrate the capsules and compute the complete structure of the objects. The CapsNet comprises a multiple-layer network [18]. Fig. 1 showcases the framework of the CapsNet model. The length of the outcome u j denotes the possibility of the occurrence of the respective element, and the direction of the vector u i encodes different characteristics of the respective element. The prediction vectorû signifies the belief that performs encoding of the relativity amongst the i−th capsule in the low-level capsules and j−th capsule in the high-level capsule by the use of a linear transformation matrix W ij , as given below.
The identified component occurrence and pose details can be used for predicting the entire existence and pose details. At the time of the training procedure, the network gets progressively learned in adjusting the transformation matrix of the capsule, paired via the respective relativity among the elements and the entire one in the objects. At the high-level capsule, the s j and v j denotes input and output of capsules j, correspondingly s j signifies the total of the predicted vectorsû j|i with equivalent weight c ij in low-level capsules i. In Eq. (2), c ij indicates the coupling coefficient and can be computed using an iterative dynamic routing approach, where j c ij = 1 and c ij ≥ 0. If c ij = 0, there is no data transmission among the capsules i and j. When c ij = 1, the details of capsule i can be sent to the highlevel capsule j. As the output length indicates a probability value, a non-linear squash function can be utilized for ensuring that the short vector can be reduced nearer to the value of 0 and the long vector can be compacted to the value of 1. The squash function can be defined using Eqs. (2)-(4): If the low, as well as high-level capsules, are reliable with the prediction process, the value of c ij is high and it gets reduced if they are unreliable [19]. By modifying the routing coefficients, the dynamic routing model gets ensured that the low-level capsule transmits the predictive vector to the high-level capsule, which is dependable with the prediction, therefore the output of the sub-capsule is transmitted to the precise parent capsule.
The Adam optimizer is used to optimally select the hyperparameter values of the CapsNet model. The Adam method is one of the widely employed techniques that alter the learning rate adoptively for all the parameters. This is an integration of distinct gradient optimization approaches. It is an exponentially decaying average of past squared gradient, i.e., RMSprop and Adadelta, as well as it takes the abovementioned gradients, i.e., analogous to Momentum.
whereas β 1 and β 2 represent the decay rates that are presented for following the default value. M t and G t is determined for estimating the mean of past gradient (initial moment) and the uncentered variation of past gradient (next moment), correspondingly. Since the decaying rate causes some bias problems, it is essential to perform the bias-correction task [20].
Hence, the upgrade value of Adam can be determined by Eq. (8) The gradient part of θ t is described by Here, it is proven that each operation is depending on the past gradient of the present parameter that has no relation to the learning rate. Therefore, Adam has an effective performance through the learning rate method.

Human Behavior Detection and Classification
During the detection and classification process, the SAE model receives the feature vectors as input and allot proper class labels to it. In this work, the SAE was introduced by autoencoder (AE) and Logistic Regression (LR) layers [21]. The AE is a building block of the SAE classification method. It is composed of a reconstruction or decoder stage (Layer 2 to 3) and an encoder stage (Layer 1 to 2). While W and W T (transpose of W) represents weight matrix of b and b mode are two different bias vectors of s can be defined by nonlinearity functions such as sigmoid function; y denotes a latent parameter of input layer x, and z is assumed as a prediction of x given y has a similar shape as x. Fig. 2 illustrates the architecture of the SAE technique.

Figure 2: Structure of SAE
Various AE layer is stacked jointly in the unsupervised pretraining phase (Layer 1 to 4). The next representation y processed as AE is applied employed as input for upcoming AE layers. Such layers undertake training as AE by minimizing reconstructed errors that are estimated simultaneously [22]. Then, reconstructed errors (loss function L(x, z)) are estimated in iteration. Here, it uses cross-entropy for measuring reconstruction error, in which x k and z k represents k rh component of x and z, respectively.
The reconstruction error is constrained under the GD application. The weight in Eqs. (11) and (12) must be upgraded as per the Eqs. (14)- (16), in which L represents a learning rate.
Once the layer is pre-trained, a process is supervised under the fine-tuning stage.

Parameter Tuning Using DE Algorithm
In order to tune the weight and bias values of the SAE model, the DE algorithm is utilized and thereby improves the recognition performance. It is regarded as a population-based search approach that is initially developed by Price and Storn [23]. In the current work, a three-step adjusting method is proposed by the DE approach for solving an optimization issue. Indeed, the target of the presented technique is to enhance the model parameter of the PID-type FLC design. To perform this task, some amount of solution vectors are initialized randomly and iteratively upgraded by selection operator and genetic operator (crossover and mutation). Initially, the mutation operator is employed by a randomly selected solution (r 1 , r 2 and r 3 ) vectors in DE population. Then, the variance among the two vectors ( r 2 & r 3 ) multiplied by a scaling factor (F) is appended to the initial vector (r 1 ). Therefore, all the targeted solution X G i are transformed as to mutant solution vector y G+1 Next, the crossover operators are employed for calculating a trial vector u G+1 i . It can be performed by integrating the target solution vectors with the mutated vectors as follows Whereas j = 1, 2, . . . , D, rand (j) ∈ [0, 1] denotes the jth parameter of a randomly generated value. CR indicates the crossover probabilities i.e., random vector ranges from zero to one. rand n(i) ∈ {1, 2, . . . , D} characterizes an arbitrary number that ensures u G+1 i get at one component from v G+1, i or else no new parent vector is produced, therefore the population remains the same. Lastly, in a selective section if as well as only if the trial vector u G+1 i produces an effective fitness function value than x G i , then u G+1 i is fixed to x G+1 i , or else, the older vector x G i is maintained.
The DE technique derives a fitness function to attain improved classification performance. It determines a positive integer to represent the better performance of the candidate solutions. In this study, the minimization of the classification error rate is considered as the fitness function, as given in Eq. (20). The optimal solution has a minimal error rate and the worse solution attains an increased error rate [24].
number of misclassified samples Total number of samples * 100 (20)

Performance Validation
The performance validation of the proposed model takes place using two benchmark datasets namely the KTH dataset. The former KTH dataset (available at https://www.csc.kth.se/cvap/actions/) is an open-access dataset, comprising six kinds of video actions and a resolution of 160 * 120. The videos are transformed into a set of 100 frames for every video.
This section investigates the result analysis of the SDL-HBC model on the test KTH dataset.  The results demonstrate that the SDL-HBC model has attained effective recognition performance. For instance, under 'Boxing' class, the SDL-HBC model has resulted to sens y , spec y , prec n , accu y , and F score of 0.9900, 0.9940, 0.9706, 0.9933, and 0.9802. Moreover, under the 'Handwaving' class, the SDL-HBC model has accomplished sens y , spec y , prec n , accu y , and F score of 0.9700, 0.9940, 0.9700, 0.9900, and 0.9700. Furthermore, under the 'Walking' class, the SDL-HBC model has gained sens y , spec y , prec n , accu y , and F score of 0.9800, 0.9980, 0.9899, 0.9950, and 0.9849. Moreover, the average result analysis of the SDL-HBC model can attain an improved average sens y , spec y , prec n , accu y , and F score of 0.9767, 0.9953, 0.9768, 0.9922, and 0.9767 respectively.       In this study, a novel SDL-HBC technique has been derived for the recognition of human behavior in intelligent video surveillance systems. The proposed SDL-HBC technique aims to properly determine the occurrence of several activities in the surveillance videos. The SDL-HBC technique encompasses several stages of operations such as AMF based pre-processing, CapsNet based feature extraction, Adam optimizer-based hyperparameter tuning, SAE-based classification, and DE-based parameter tuning. The utilization of the Adam optimizer and DE algorithm results in improved classification performance. The simulation result analysis of the SDL-HBC technique is carried out against two benchmark datasets namely KTH and UCF Sports datasets. The experimental results reported the enhanced recognition performance of the SDL-HBC technique over the recent state of art approaches. Therefore, the SDL-HBC technique can be considered an effective tool for intelligent video surveillance systems. As a part of the future scope, the performance of the SDL-HBC technique can be boosted by the design of hybrid DL models.
Funding Statement: This research was funded by the Deanship of Scientific Research at the University of Business and Technology, Saudi Arabia.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.