A Framework for Mask-Wearing Recognition in Complex Scenes for Different Face Sizes

People are required to wear masks in many countries, now a days with the Covid-19 pandemic. Automated mask detection is very crucial to help identify people who do not wear masks. Other important applications is for surveillance issues to be able to detect concealed faces that might be a safety threat. However, automated mask wearing detection might be difficult in complex scenes such as hospitals and shopping malls where many people are at present. In this paper, we present analysis of several detection techniques and their performances. We are facing different face sizes and orientation, therefore, we propose one technique to detect faces of different sizes and orientations. In this research, we propose a framework to incorporate two deep learning procedures to develop a technique for mask-wearing recognition especially in complex scenes and various resolution images. A regional convolutional neural network (R-CNN) is used to detect regions of faces, which is further enhanced by introducing a different size face detection even for smaller targets. We combined that by an algorithm that can detect faces even in low resolution images. We propose a mask-wearing detection algorithms in complex situations under different resolution and face sizes. We use a convolutional neural network (CNN) to detect the presence of the mask around the detected face. Experimental results prove our process enhances the precision and recall for the combined detection algorithm. The proposed technique achieves Precision of 94.5%, and is better than other techniques under comparison.


Introduction
Masks play an important role in defending people against Covid-19. Many countries forced laws to weak masks in public. Some people may not adhere to such protective laws or may forget to wear masks. Nevertheless, wearing a mask might lessen the risk of corona infection [1]. A report of the WHO confirmed over 120 million confirmed cases with more than 2 million death by February 2021. Many of these cases could be avoided by taking precaution steps such as wearing masks and social distancing. Respiratory drops from sneezing or talking loudly can transmit covid-19 as depicted in [2,3]. A proven way to reduce covid-19 spreading is to enforce mask wearing in public as advised by WHO [2].
In this paper, we studied the micro-graphs of different types of masks by considering the texture features of the images of those masks. Texture features extraction can be performed by fractal analysis methods, Fourier transform techniques, also, gray co-occurrence matrix. Authors at [4], presented a study that indicated that gray co-occurrence matrix (GLCM) can quantify the spatial association among adjacent pixels in a digital image. GLCM is utilized in texture detection [5], skin texture identification [6], defect recognition [7], fabric cataloging [8] and egg fertility detection [9]. Authors in [10], categorized fabrics utilizing Support Vector Machine with GLCM and Component Analysis. The experimental results of [11] show that the technique of combining GLCM and Back propagation deliver 93% accuracy. Since GLCM is efficient in texture analysis, we decided to utilize GLCM for the texture analysis of masks. Also, K Nearest Neighbor technique is simple and possess high performance in classification process [12], it is successfully used in text medical domain [13], ingestion patterns examination [14]. The authors in [15], concluded that k-nearest neighbor's algorithm (KNN) variant technique could identify the Coronavirus patients more accurately. Therefore our research utilizes the KNN algorithm to detect the time-wearing of masks based on the texture features mined from the micro-images at different period time.
CNN is a neuron machine model that uses convolution construction [16]. LeNet was one of the first CNN to be proposed. Rectified Linear machines gave a boast to CNN by year 2011. Object detection methodologies are established using Region-based Convolutional Neural Networks (R-CNN) [17]. R-CNN is an upgraded process based on CNN. Authors in [18], projected an epoch R-CNN system that was utilized mainly for object detection. It utilized search selective technique to identify borders from the rest of the image and normalize it to the CNN input size, followed by object identification by the SVM using linear regression model. The drawbacks were complicated training phase and slow testing speed.
An enhancement of the training phase and the testing speed of R-CNN was performed by the proposed Fast R-CNN. Fast R-CNN utilizes less layers with pooling layer. It also used the Softmax classifier instead of the SVM for better and faster classification [19]. However, the utilized selective search technique used in the fast R-CNN makes the computation speed not suitable for large data sets. Faster R-CNN introduces the combined feature extraction and bounding box regression to speed up the process. Faster R-CNN executes well for large objects (faces), but cannot identify small objects or faces [20]. We have to utilize other techniques for smaller objects that can optimizes face detection with respect to scale invariance and image resolution. Scale invariance is a central part of all current object detection methods.
Observing mask-wearing manually in large crowds can deem impossible and costly, especially in the Pilgrimage season. Reducing manual observation while making certain that all people are wearing masks all the time is an urgent problem. Image analysis tools can reduce the labor force and material costs, and can considerably protect people in many areas. Computer vision procedures and can be utilized for mask detection. Deep neural networks also are utilized greatly in object classification and recognition [21].
There are very few computer vision research that was done for mask detection and is usually utilized in very simple scenes but not from surveillance images. For this reason we will survey other object detection techniques that can help us in our research. For object detection, researchers utilized a histogram of gradient and a SVM machine to detect persons and a Hough transform to detect helmets and located faces by background subtraction. They also used background modeling and face classification method called C4 to detect faces, and detect whether masks using color transformation and recognition [22]. But it was not suitable for complex scenes and energetic backgrounds, such as busy streets and Pilgrimage season. Authors in [23], utilized a shot object detector process to detect faces with RetinaNet with a multiple features to overcome the limitation in accuracy. They also utilized the (YOLO) algorithm to detect helmet wearing in images with less than four people.
In this research, we propose a framework to incorporate two deep learning procedures to develop a technique for mask-wearing recognition especially in complex scenes and various resolution images. In this paper, we propose a mask-wearing detection algorithms in complex situations under different resolution and face sizes. Testing data are utilized for validation of our model. The paper is organized as follows, method is described in Section 2. Experiments are described and results are analyzed in Section 3. While conclusions are depicted in Section 4.

Method
The proposed technique is for mask-wearing detection purpose. The proposed model is comprised of two streams; the first one is to detect anchors for faces and the second stream is utilized to detect the mask-wearing objects. The block diagram of the proposed technique is depicted in Fig. 1. The face detection stream identify and compute the anchor boxes which contains the faces using sliding window. Mask-wearing detection can be addressed as object detection algorithm. Many object detection algorithms are established in the literature. Neural Networks (CNN), support vector machine (SVM) and naive Bayes classifier, are examples of such object detection algorithms. CNNs are found to be more suitable for scenes with complex surroundings that have many objects with different orientations. CNNs have superior accuracy and dependability for such cases. We choose to utilize convolution neural networks to resolve this matter. In this paper, we are using Region-Based CNN for face detection (R-CNN). R-CNN performs feature extraction via a several layers. Namely: Input, Convolution, relu and pooling layers. Region-Based CNN uses a large number of anchor boxes to the input image. 256 positive anchor boxes and 256 negative boxes will be randomly selected in the training phase. Softmax layer will utilize these anchor boxes to extract candidate regions and their bounding boxes. R-CNN scans images utilizing a sliding windows above the anchors. At the end it yields the anchors with the maximum probability of enclosing a face object as shown in Fig. 4.  For an input image Im, the ground-truth anchor are represented as G. G A represents the selected anchor boxes with Similarity > ɛ, where ɛ is a predetermined threshold that found to give better classification when it is equal to 0.75 as stated before. The symbol c A represents the presence of the mask confidence score, calculated by the R-CNN algorithm. The term A represents the used algorithm, W A is the weight of the CNN, and they are depicted in the following Eqs. (2)-(4) Therefore, the Loss function can be defined as follows: Lossð AðIm; W A Þ½0; AðIm; W A Þ½1; GÞ (3) To train the model we use the minimization function in Eq. (4).

Different Size Face Detection Algorithm (DS-Face)
A one-stage face detection method is proposed that incorporates the fusion of multi-scale features. We present an algorithm called DS-Face to facilitate the detection for different small-sized faces. The training data are utilized on multiple scales to enhance the prediction accuracy. For an input image Im, we use G D to represent the bounding boxes and c D to represent the confidence scores computed by the Different small-Size Face Detection algorithm. Where, D represents the Different Size-Face Detection algorithm and W D represents the weight, we can represent that in Eq. (5).

Low and High Resolution Mask-Wearing Detection: LHMD
For the face anchors G A and G D identified by the proposed algorithm, we perform a merging procedure using the following scheme:  c. If any anchor G n in G D possess more than 70% overlapping area with G A [i]:

Then
• Remove G n from G D • Remove the corresponding entries from Score p .
• Calculate the average (Avg) value of the removed entries in Score p • Insert Avg into c SC .

Else
Insert zero into c SC .

CNN for Mask-Wearing Detection
The three algorithms: R-CNN, DS-Face and LHMD are able to determine that the image encloses a face but cannot determine if a mask is present. Therefore, we are adding the usage of a CNN to identify maskwearing. For anchor boxes identified by the Different Size-Face Detection algorithm, we will use a CNN for mask-wearing detection. The face region is enlarged and used as an input to the CNN for prediction. The confidence score designates there is a mask-wearing on the identified face through the CNN, as shown in Eq. (6).
Score p ¼ Pð G D ; Im; W P Þ (6) P represents the forward propagation function of CNN. Where P denotes the composition of two convolutions followed by a fully connected (FC) layer. And the Loss function is presented in Eq. (7). minmize ðLossðG D ; PðG D ; Im; W P Þ; GÞÞ 3 Experimental Results A large dataset of mask-wearing faces is essential for the training phase of deep learning networks. Classification of mask-wearing faces and non-wearing masks. In [18], they build a public dataset of mask-wearing faces (CMFD), and non-wearing masks (IMFD) in the MaskedFace-Net. 137, 016 face images that include different sizes faces and are accessible at [18]. Half of them are mask-wearing faces and the other half of non-wearing mask faces.
The performance of our proposed method is evaluated using multiple criteria. We compute the true positive (TP), false positive FP, false negative FN, the recall and precision rates. Six-fold cross validation model is utilized for the experiments. The dataset is randomly partitioned into six partitions. The Training set contains half of the images (both faces with masks and without masks). The validation subset contains 2/6 and the testing subset contains 1/6 of the whole database.

Analysis
The experimental results of the R-CNN for identifying faces are depicted in Figs. 5 and 6. From the results, it is concluded that the R-CNN is fit to identify large faces, but not small faces. While the combined DS-Face algorithm combined with LHMD algorithm plus CNN can identify small faces with high and low resolution as shown in Fig. 6, while Fig. 7 depicts the results of combing the three algorithms DS-Face + LHMD + R-CNN.

Accuracy of the Mask Detection Algorithm
In this section, we are going to evaluate the mask detection Algorithm accuracy. We found that increasing the number of training steps in the model will avoid overfitting. We made different sets of experiments to observe the accuracy with various number of training steps. At the training phase, we labeled the face in the training dataset with wearing or not wearing mask labels. We utilized images from the dataset as training subset to assess the accuracy of the model. We used 10,000 steps in the training procedure and tested the images in the training subset, the accuracy was not satisfactory with missing most of the small faces. We can state that the precision of the model is evaluated by the number of detected faces (large and small). We repeated the experiment by increasing number of training steps to 20,000 steps. We found that the number of detected faces in 1500 images as testing subset is kept at about 1900 target. With increasing the number of training steps, the count of detected faces increased slightly. with 50,000 training steps, the detected faces increased to 2511 faces. We measured the precision and recall rate at 50,000 training step milestone and found them to be 88.3% and 86.9% respectively. By increasing the training steps to 75,000 steps, the number of detected faces reached around 3000 faces with precision rate of 93.2%, and recall of 91.3%. The results for different number of training steps is depicted in Tab. 1. We also display the ROC curve on the training image set in Figs. 9-11.  To measure the accuracy of the R-CNN we cropped the images to include only one target. We applied the cropping to 7,000 images from the training image set. We got over 30,000 images with only one face in it after the cropping phase. We utilized 25,000 images as training input for the R-CNN, and the 5,000 images for the validation phase to measure the accuracy of the R-CNN. The 30,000 cropped images are partitioned into people with and without masks.  We utilized the cross-validation method [24,25], and the CNN composed of six convolution layers followed by pooling layers. The first layer and the second convolution kernel layer are defined as [5,5], while the size of the third and fourth convolution kernel layers is [3,3]. With this configuration, the precision of the classifier at the last layer reached 90.3%. The ROC curve on the training set using the R-CNN alone is shown in Fig. 9.
The ROC curves of the R-CNN combined with DS-Face proves that such combination can enhance the accuracy of the classification result. The area under its ROC curve coverage increased to be 0.88, as shown in Fig. 10.
The ROC curve of the combined R-CNN and DS-Face has been increased from 0.83 to reach 0.88. The ROC curve covered of the model that combined R-CNN, DS-Face and the LHMD model has the largest area under the curve than the previous two models. This model is the best among the three models and has coverage area of 0.91.

Comparison of the Three Models
We performed a comparison between the three models using the metrics TPR, FPR, FNR, precision, and recall. The results are demonstrated in Tabs. 2 and 3. The tables summarize the same results showing the third model that combines the R-CNN and our propose algorithms DS-Face and LHMD works better in terms of all the metrics.

Conclusion
In this research, we propose a framework to incorporate two deep learning procedures to develop a technique for mask-wearing recognition especially in complex scenes and various resolution images. An R-CNN is used to detect regions of faces, which is further enhanced by introducing a different size face detection even for smaller targets. To increase the accuracy of our R-CNN we integrated two more algorithms. DS-Face to increase the accuracy of detecting different size faces especially small faces.  While LHMD algorithm is to increase accuracy for low resolution targets. The combined model improves the accuracy to a higher extent. Our experiments proved that the DS-Face with the mask classifier CNN overcome the inadequacies that the R-CNN model alone have when trying to detect small faces. The accuracy of a single deep learning model R-CNN combined with the CNN for mask detection did not meet perform well for mask detection because it misses a lot of small faces and low resolution targets. We reached a conclusion of combining two other algorithms to the R-CNN to detect more faces to achieve better results. We combined the base algorithm R-CNN with other algorithms for better mask detection algorithm to improve the detection accuracy. Our model can be faster and performs in real time by utilizing GPU and distributed computing for faster processing speed especially in complex scenarios.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.