Utilization of Deep Learning-Based Crowd Analysis for Safety Surveillance and Spread Control of COVID-19 Pandemic

Crowd monitoring analysis has become an important challenge in academic researches ranging from surveillance equipment to people behavior using different algorithms. The crowd counting schemes can be typically processed in two steps, the images ground truth density maps which are obtained from ground truth density map creation and the deep learning to estimate density map from density map estimation. The pandemic of COVID-19 has changed our world in few months and has put the normal human life to a halt due to its rapid spread and high danger. Therefore, several precautions are taken into account during COVID-19 to slowdown the new cases rate like maintaining social distancing via crowd estimation. This manuscript presents an efficient detection model for the crowd counting and social distancing between visitors in the two holy mosques, Al Masjid Al Haram in Mecca and the Prophet's Mosque in Medina. Also, the manuscript develops a secure crowd monitoring structure based on the convolutional neural network (CNN) model using real datasets of images for the two holy mosques. The proposed framework is divided into two procedures, crowd counting and crowd recognition using datasets of different densities. To confirm the effectiveness of the proposed model, some metrics are employed for crowd analysis, which proves the monitoring efficiency of the proposed model with superior accuracy. Also, it is very adaptive to different crowd density levels and robust to scale changes in several places.


Introduction
On March 2020, the World Health Organization (WHO) reported that COVID-19 had spread in 114 countries around the world with 118,000 active cases and 4,000 deaths [1]. With the increasing number of daily infections, the global community has warned of the COVID-19 seriousness and has begun to look for different ways to stop pandemic. Several of scientists and medical organizations, are trying to develop effective vaccines or drugs especially with the virus variants. Moreover, the respiratory viruses are considered highly contagious like COVID-19 which is transmitted through close contact with an infected person (from coughing and sneezing). WHO suggested the social distancing between individuals to decrease the COVID-19 spread rate via ensuring two meters between two persons as a physical distance in stores, big malls, parks, train stations, celebrations, championships and places of worship [2][3][4]. Safety management of public spaces has encouraged numerous researchers to analyze and develop schemes for crowd monitoring [5][6][7][8]. Consequently, different associated tasks such as density estimation, counting, behavior detection, tracking and localization of crowd scene are studied [9][10][11]. The most important tasks which receive much attention to fight against COVID-19 are crowd counting and density estimation which the proposed manuscript focused. Since 1990, computer vision researchers have started to discuss virtual environments in many disciplines with some limitations due to lack of information. With continuous population and digital information growth, intelligent environments have become very challenging task to security or safety public management. Therefore, monitoring and analyzing crowd, surveillance cameras have been studied in computer vision issues with respect to social and computational aspects [12][13][14]. Early years, computer vision algorithms supported strongly the video surveillance systems; however there was a noticeable degradation in recording the identification and tracking mechanism from community to community due to crowd density. Various attributes are employed to analyze the behavior, density estimation and counting of crowd scene [15,16] which is performed in three basic steps: i) Pre-processing by segmentation, ii) Tacking individual and group of an object, iii) Recognition of event and behavior. Because the previous traditional analysis has resulted in long time consumption, scientists have applied artificial intelligent (AI) techniques to analyze and classify the crowd scene. AI allows fast and accurate tracking and estimation in real-time of people numbers in crowded places. Recently, governments used AI technology for many tasks such as surveillance over the internet dating, advertising, recruiting, terrorism prevention and fraud detection. Deep learning (DP) model is a subset of AI technology which consists of multi-layered artificial neural networks (ANNs) to obtain advanced accuracy in several tasks like detection, recognition and classification. The most researches have verified the classification effectiveness of using DP types like CNN and recurrent neural network (RNN). CNN and RNN are applied for analyzing the crowd behavior and classifying digital images or videos [17][18][19][20][21][22]. The most important religious occasions for Muslims are Hajj and Umrah, which gathers high density crowds in the holy sites as shown in Fig. 1. In 2019, the Saudi Arabia general authority for statistics has recorded more than 2.4 million in Hajj occasion [23]. Tawaf is one of the major pillars of both Hajj and Umrah around the Kaaba which is performed seven times with a noticeable increase in crowd density. An additional part of the Tawaf is kissing the Black Stone which is considered a hard task.
Motivated by the rampant of COVID-19, which consider crowds its first enemy and social distancing is one key for fighting its spread, the proposed work is presented. This article designed a deep learning CNN model for monitoring the social distancing using digital images and surveillance video in the two holy mosques, Al Masjid Al Haram in Mecca and the Prophet's Mosque in Medina. The proposed framework is categorized in two processes, crowd counting and crowd recognition using datasets of different densities. The crowd results are analyzed by evaluation metrics to validate the presented CNN model.
The rest sections of this article are constructed as: Section 2 presents the related work. Section 3 introduces the outline methodology of CNN. Section 4 explains the proposed crowd monitoring strategy. Section 5 explores and discusses the outcome results of the proposed crowd monitoring strategy. Section 6 concludes the article.

Related Work
The crowd demographics are an essential to understanding its risk in the presence COVID-19. Various approaches were done to deal with the crowd counting problems and keep up with the rapidly advancing trends. This section explores the recent state-of-art methods for crowd estimation starting from the first crowd analysis methods in [24] which discusses the researchers' perspectives in the computer vision field. In [25], the authors survey computer vision methods in crowd estimations, tracking, density and event detection. The work validation is evaluated through different simulations. In [26], the traditional crowd counting three methods detection, regression and density were introduced. The three approaches were analyzed through simulation to explain their features principles. Video imagery for population profiling in public spaces is an effective tool for counting. Therefore, [27] has compared the previous video imagery based crowd counting methods. In [28], a survey on crowd counting strategies in computer vision and their density estimations for visual surveillance was discussed. In [29], a survey on crowd statistics and its behavior is explored for identifying groups with datasets for crowd activity video. In [30], numerous single image crowds counting and its density methods are explored. The work had focused on recent AI approaches based on CNN. In [31], a comprehensive survey of for crowd analysis using current CNN model with the most applied software in researches is introduced. It reviewed the optimization methodologies of CNN with its important aspects for crowding estimation. For Saudi public places, a special crowd algorithm using Internet of Things (IoT) for estimating areas density is presented in [32]. The research employed a dataset of 750 images which were collected from various videos of malls, airports and Makkah. In Switzerland, 508 male soldiers with age of 21 years were infected by COVID-19. In [33], a study for following the infections cases in two spatially separated groups with coronavirus 2 (SARS-CoV-2) before and after social distancing is examined. The results showed that social distancing can slow the SARS-CoV-2 spread and prevent the COVID-19 infection. An innovative localization method to track an object position based on sensors is introduced in [34]. The smart device is Figure 1: Examples of crowded scenes in the two holy mosques used AI technique for maintaining a social distancing especially in COVID-19 environments. It will give an alert if someone in the range of six feet around someone. The results proved the high accuracy and efficiency of the designed device. In [35], a digital solution using Deep Learning technique for detecting the social distancing violation or the extra limit in the people number which allowed being in a certain place via an alert is suggested. With the help of Pose-Net model, a video stream obtained from the CCTV camera for detecting people and keeping a track of the humans present in video stream is designed. This technology can save time, fast analysis and will help CCTV cameras to monitor every gathering place.

CNN Methodology
Deep CNN is a sequence of different layers that specially designed to deal with visual data. Its layers architecture contain three types: Convolutional, Pooling, and Fully-Connected (F.C) Layers as illustrated in Fig. 2. Every layer uses certain differentiable functions to transform one volume of activations to another. An input image is processed through three rounds of layers to extract its features from and reduce its size.
The convolutional (Conv.) layer is contained some convolution kernels to change the input dimensions and compute different feature maps. The feature value at location i × j in the k-th feature map of l-th layer, z l i;j;k is computed by multiplying the input weights and a small region depending on the used kernel size and passing the results to next layer: The Rectifying Linear Unit (ReLU) layer represents activation functions, which returns a positive value or 0 in place of previous negative values. The activation function is applied to introduce nonlinearities to CNN to determine nonlinear features. Let a l i;j;k the nonlinear activation value: The Pooling Layer performed the down-sampling operation to reduce spatial input dimensions. It is placed between two convolutional layers and applied the most used type which is Max Pooling by selecting the maximum element from the window [36]. where R ij is a local neighborhood around i × j location.
The F.C Layer or dense layers is consisted of a defined number of neurons to perform a linear operation in which every input is connected by a weight to every output. Also, it is responsible for final decision making by using an activation function such as soft-max or sigmoid to classify the features outputs. The crowd counting may be considered a difficult issue because of pedestrian detection and tracking difficulty and varying crowd distributions. The work goal is to train the mapping how to distinguish among low, medium and high level features by counting the crowd. Therefore, the main steps for implementing the proposed algorithm can be summarized as: 1. Generate the normalized crowd density map and density counting as a preprocessing step.

2.
Develop an effective features extracted to describe crowd based on the proposed CNN model for better representation.
3. Train additional data for different crowd densities to increase the model capability for analyzing other untrained scenes. 4. The CNN is trained for crowd scenes by learning process with two learning objectives, crowd counts and classification. This proposed CNN learns crowd specific features, which are more robust than handcrafted features. Three main methods are presented to implement the CNN training process namely the proposed algorithm 1, the proposed algorithm 2, and the proposed algorithm 3. The specifications of these methods are described as listed in Tab. 1.
These methods are proposed to introduce the effect of changing the training parameters values and number of training images on the performance and the efficiency of classification process.

Crowd Density Map
The density map integral for a set of objects inside a sub-region, is useful for both counting and tracking. Most of CNN methods construct density maps of low resolution due to the down-sample operations. Therefore, the proposed crowd density map depends on the local features or the probabilistic crowd observation as illustrated in Fig. 3.
The normalizing process allows learning the mapping F: X → D, where X the low-level features set which extracted from training images and D an image crowd density map. The density map depends on spatial location, human body shape and perspective distortion of images. This mainly has three visible characteristics: 1) a person image has different scales because of the perspective distortion in a crowd scene; 2) its shape is more likely to ellipses than circles; 3) heads and shoulders represent the main indication for counting persons in images. Normalized crowd density map for training is expressed as:

Crowd CNN Proposed Model
The image patches cropped from the training images represents the crowd CNN model inputs. To have pedestrians of similar proportions, each patch size in different locations is selected according to its center pixel perspective value. The proposed crowd CNN model contains 11 layers: 5 convolutional layers, 5 pooling layers and one FC output layer with the soft-max function as shown in Fig. 4.
Conv.1 has 2 × 2 kernel size with 8 filters followed by Max pooling layers with a 2 × 2 kernel size. Then, Conv. 2 has 2 × 2 kernel size with 16 filters followed by Max pooling layers with a 2 × 2 kernel size until the last layer Conv.5 which has 2 × 2 kernel size with 128 filters.
ReLU, which is the activation function applied after every convolutional layer. Tab. 2 shows the detailed of the proposed CNN network. Finally, FC layer and soft-max activation function to connect the output to a number of the classes by predicting according to the density estimation into three categories (Low density crowds, High density crowds and Moderate density crowds).

Materials Description
The proposed model is evaluated using crowd scenes database obtained from the Holy Kaaba and the Prophet's Mosque images according to the latest statistics during COVID-19. The applied dataset total number is 480 images as explored in Tab. 3. The dataset composed of 160 low-density crowds' images, 160 high-density crowds' images and 160 moderate-density crowds' images used for the training and testing phases. The datasets are divided randomly into two groups; first group is 80% of materials used for training the CNN model which repeated 40 times and second group is the remaining 20% for the testing purpose. Fig. 5 illustrates a sample of the studied scenes which shows low, high and moderate-density crowds.

Simulated Results
The proposed deep crowd algorithm will be a useful aid for authorities to control the crowd surveillance in the Holy Mosque and the Prophet's Mosque to reduce overcrowding during COVID-19. The proposed model provided an accurate and efficient density crowd analysis and classification through a designed training CNN networks and a binary classifier. In the proposed work, the dataset was split into 80% for training and 20% for testing with pre-trained CNN model. Then, deep features were extracted, fused and ranked based on the correlation values between features. The density map and counting estimation process of the input patch are introduced before the classification step which provides much local and detailed information for the CNN algorithm to obtain a better representation of crowd patches. The proposed crowd model classified the crowds into: low, moderate, and high density crowds according to the state of crowding. It depends on estimating the number of people in a square meter area and then provides an indication of the precautions to deal with the situation via disinfection or ventilation proceedings.
The outcomes of the proposed model are compared with the ground truth dataset and evaluated through some statistical measurements such as: accuracy, sensitivity, specificity, F1-score, precision, and Matthews Correlation Coefficient (MCC).
Since the proposed network is training iteratively, the only way to know the optimal number of epochs is by plotting network accuracy vs. epochs. After the network has run for several epochs, the network accuracy started saturating. Therefore, to achieve the desired performance, the network is verified through three different algorithms using some regularization techniques such as (data augmentation, training losses and number of epochs) and tracks their values to avoid over fitting. The proposed three algorithms have been implemented under three different parameter conditions as shown in Tab. 4.
Figs. 6 and 7 show the performance evaluation of the proposed three algorithms in terms of accuracy and cross-entropy (loss) for both the training and testing stage which are implemented up to 1280 iterations. By analyzing the obtained results, it is cleared that the proposed algorithm 3 has significant effects on classifying the crowd density using different datasets. The network could conduct better features extraction with   maximum accuracy and minimum loss factor. Therefore, the proposed algorithm 3 is recommended for high accuracy of crowd density classification with the implemented datasets. The classification performance of the proposed model in terms of accuracy, specificity, sensitivity, precision, F1-score and MCC for three different algorithms are summarized in Tab. 5. The results indicated that, the proposed algorithm 3 has conducted superior performance in classification of low density crowd, moderate density crowd and high density crowd as 100% overall accuracy, 100% specificity, 100% sensitivity, 100% precision, 1 for F1-score and 1 for MCC.
However, the classification performance in the other two algorithms has demonstrated lower overall performance evaluation in all metrics: accuracy, specificity, sensitivity, precision, and MCC due to the lower input dataset images and lower implemented epochs. Thus, the proposed algorithm 3 gives better performance with increasing the input datasets and the number of implemented epochs as shown in Fig. 8.  The confusion matrix is used to clarify the prediction ratio for different classification as shown in Fig. 9. The confusion matrix is based on the false positive rate and true negative rate considering balanced distribution in dataset which improved the results for prediction ratio and enhanced the model performance. It was found that the proposed algorithm 3 has better and consistent true positive and true negative values which can efficiently classify different crowd densities. Figure 9: The confusion matrix of the classification process in the proposed crowd CNN for the three algorithms (a) Algorithm 1 (b) Algorithm 2 (c) Algorithm 3

Conclusions
In visual surveillance crowd analysis is a crucial component in security applications. In the time of COVID-19, the computer vision is applied to analyze and collect the visual data from the live worldwide network cameras for observing the human activities and monitoring the social distancing through crowding to prevent the spread of this epidemic. The proposed work is presented to analyze the crowd density of visitors in the two holy mosques, Al Masjid Al Haram in Mecca and the Prophet's Mosque in Medina for taking the appropriate proceedings to prevent the transmission of infection. Hence, this research design a deep learning CNN model for crowd scenes classification into low, moderate, and high density based on the normalized density map generated. The proposed model carried out on two stages, generating the normalized crowd density map and density counting, then feature extrication and classification the crowd scene. The performance of the proposed models are implemented with different parameters values and evaluated based on various evaluation metrics such as specificity, sensitivity, precision, MCC, and F1 score. An extensive analysis showed an improvement performance and better efficiency by the proposed architecture in crowd density classification with maximum accuracy and minimum loss values.