Efficient Deep CNN Model for COVID-19 Classification

: Coronavirus (COVID-19) infection was initially acknowledged as a global pandemic in Wuhan in China. World Health Organization (WHO) stated that the COVID-19 is an epidemic that causes a 3.4% death rate. Chest X-Ray (CXR) and Computerized Tomography (CT) screening of infected persons are essential in diagnosis applications. There are numerous ways to identify positive COVID-19 cases. One of the fundamental ways is radiology imaging through CXR, or CT images. The comparison of CT and CXR scans revealed that CT scans are more effective in the diagnosis process due to their high quality. Hence, automated classification techniques are required to facilitate the diagnosis process. Deep Learning (DL) is an effective tool that can be utilized for detection and classification this type of medical images. The deep Convolutional Neural Networks (CNNs) can learn and extract essential features from different medical image datasets. In this paper, a CNN architecture for automated COVID-19 detection from CXR results are obtained using the ReLU activation function combined with the SGDM optimizer at a learning rate of 10 − 5 and a minibatch size of 16.


Introduction
The epidemic of COVID-19, which appeared in Wuhan city in China, results in pneumonia with fever and cough as the main indications of infection. A study performed on CT images to detect the disease infection proved that the detection rate from CT images is better than that from the RT-PCR. So, a chest CT scan was recommended [1][2][3][4][5].
Classification is an essential process in learning tasks, and it is a fundamental problem in the recognition area, which aims to classify medical images into several different categories. The classification of medical images includes two main steps. Firstly, the most helpful image features are extracted. Secondly, these features are used in building the models for dataset classification. Usually, specialists use their feature extraction experience to categorize medical images into different categories, making the classification sometimes tricky and time-wasting. Recently, DL has arisen due to its high quality and vast application domains in several research areas, especially for classifying medical images since pre-processing or feature extraction is not required before training the model. A CNN is one of the latest progressions in machine learning (ML) area. It can be used for the analysis of medical images.
With the massive growth of neural networks and DL, finding an optimum model architecture for each application is necessary. Much work has been carried out to achieve the desired performance level and to obtain the best accuracy in any classification task. Activation layers such as Sigmoid, Tanh, and ReLU define the non-linearity of the neuron output [6,7]. A CNN comprises several layers ordered as the input layer, convolution layer, activation layer, fully-connected layer, classification layer, and output layer. Moreover, as machine learning algorithms are optimized, a significant improvement in their performance can be achieved. Therefore, finding a suitable activation function and optimizer are basic tasks [6,7].
The objective of this work is to carry out comparisons between different activation functions and different optimizers for the classification of CXR and CT image datasets for COVID-19 detection. The CNNs have proved efficient performance in the classification of medical images. Therefore, this paper presents a CNN model for COVID-19 detection from CXR and CT images with a new training strategy. This strategy depends on the proper selection of the optimizer and the activation function. The rest of this paper is structured as follows. Section 2 summarizes the related work in this field. Section 3 gives short notes about the CNN. Section 4 describes the materials and methods used in the paper. Section 5 illustrates the proposed model architecture. Section 6 shows the experimental results and discussions. Section 7 provides the conclusions.

Related Work
The World Health Organization (WHO) has stated that COVID-19 rapidly spread in several countries worldwide. Early detection of COVID-19 cases can significantly control the spread of this virus. Much work has been performed on this topic due to its importance. This paper depends on DL to automatically detect COVID-19 from CXR and CT images. The performance of different classifiers is investigated to determine the optimum one [1][2][3]. The CXR and CT images can be used to detect COVID-19 cases. The CNN is one of the most popular and effective tools that identify COVD-19 from medical images [1][2][3]. Several review studies have been presented to highlight recent contributions to COVID-19 detection [8][9][10][11][12][13][14][15][16]. Several works used radiology images to identify and classify COVD-19 cases. Zheng et al. [13] proposed a DL model to classify pneumonia. Xu et al. [14] presented a model to classify pneumonia from CXR images based on compressed sensing (CS) with a deep transfer learning model. Sethy et al. [17] used the SVM classifier to classify the features acquired from several CNN models applied on CXR images. They achieved the best performance using the ResNet50 model with SVM.
Wang et al. [18] suggested transfer learning model called COVID-Net to detect COVID-19 from CXR images. Their model achieved 92.4% accuracy for three classes: Normal, Non-COVID pneumonia, and COVID-19. Hemdan et al. [19] applied DL models to detect COVID-19 from CXR images and suggested a model called COVIDXNet. Their model achieved a 0.95 AUC value and a 0.96 sensitivity. Additionally, there is an online service to diagnose COVID-19 from CT images [20]. Wang et al. [21] used a CNN based on the Inception network model to identify COVID-19 cases from CT images. Ioannis et al. [22] proposed a DL model using 224 confirmed COVID-19 images. The authors of [23] proposed a model to classify COVID-19, influenza, and healthy CT image cases. Their model achieved an accuracy of 86.7%. In [24], the authors proposed a learning model to separate the main features in CT images in a pre-processing stage. Their model achieved accuracies of 89.5% and 79.3% with and without the pre-processing stage, respectively. Ozturk et al. [25] suggested a model that classifies CXR COVID-19 images. Their model has been applied to classify three main classes: COVID, No-COVID, and pneumonia, and achieved a classification accuracy of 87.02%. Alsharman et al. [26] used CNNs to classify CT COVID-19 images. They used a pretrained Google-Net CNN architecture and achieved an accuracy of 82.14%. The DL growth has a significant effect on the medical field due to the better ability to classify medical images. Several image classification techniques can give radiologists another opinion. The recent research works on medical image classification are summarized in Tab. 1.
In this paper, a DL model is presented to classify COVID-19 CXR and CT images. The proposed model has been trained from scratch without using any feature extraction approaches. It has been trained with 1000 CXR and 1000 CT medical images. One of the essential advantages of well-trained DL models is that they can extract features that are not apparent to the human eye. Hence, accurate classification can be performed.

Convolutional Neural Networks (CNNs)
Recently, DL has arisen due to its efficiency in a variety of application domains in several research areas, especially for classifying medical images since pre-processing or feature extraction is not required before the training process. The CNN has gained a significant importance, and it was utilized in most of the state-of-the-art applications. They were extensively used to detect and identify diseases in different medical images. The main difference between a CNN and an ANN is that the CNN has a large number of hidden layers. So, the CNN constitutes a deep architecture. It consists of several stacked layers ordered as input layer, convolution layer, pooling layers, activation layer, fully-connected layer, classification layer, and output layer.
The input layer enhances the image using pre-processing such as normalization and scaling. The convolution layer convolves the image with several suitably adjusted filters. This convolution results in feature maps. Then, the pooling layers are used to minimize the dimensions of the generated feature maps. Pooling is carried out using a window with a proper stride. Either maxpooling or average pooling is used. In max pooling, the maximum value is chosen.
On the other hand, in average pooling, the average value is estimated and used. The activation functions define the non-linearity of the model. Finally, the fully-connected layer is the output layer that clarifies the classification result using the SoftMax classifier to determine the image class.

Activation Functions
The appropriate activation functions must be carefully chosen, because they significantly affect the neural network performance. The main target of activation functions is that they provide nonlinearity to their input. There are three famous activation functions, namely, Sigmoid, Tanh, and ReLU. These functions are used and studied in this work. They are summarized as follows: • Logistic Curve (Sigmoid) The Sigmoid function is defined as follows: The sigmoid activation function converts its input range from [−∞; +∞] to [0; 1]. The main disadvantage of the sigmoid is that it is computationally expensive, and it cannot solve the problem of vanishing gradients.

• Hyperbolic Tangent (Tanh)
The Tanh is a non-linear function. It converts the range of the input to [−1, 1]. It can be defined as: An advantage is that Tanh has steeper derivatives than the sigmoid function. On the other hand, it cannot solve the vanishing gradient problem. •

Rectified Linear Units (ReLU)
The ReLU is the most common activation function, and is the mostly-used one. Using the ReLU function in a model makes it easier to train and often achieve better performance. The ReLU function is defined as follows: The main advantage of the ReLU function is that it contains no exponential terms or divisions, which results in increased the computation speed. However, it easily overfits. The benefits and limitations of different employed activation functions examined through the simulation tests are summarized in Tab. 2. The activation value does not vanish.
• Avoided when initializing a network with small random weights.

•
It suffers sharp damp gradients, slow convergence, and non-zero centered output. • Gradient updates not in the same direction.
Its derivative is steeper.

•
More efficient due to the wide range.
• Like sigmoid, it suffers from vanishing gradient problem.
• No exponentials or divisions result in increased computation speed. • Sparsity in the hidden units and output values between zero and maximum.

Optimizers • Stochastic Gradient Descent with Momentum (SGDM)
Optimization of the model greatly contributes to minimizing the loss function. The SGDM is one of the powerful and most-commonly used optimizers. It is an improvement of the SGD optimizer. It depends on the current gradient and the past momentum to estimate the momentum in each dimension. It also accumulates the gradient of the past steps to determine the direction to go. The SGDM optimizer saves the update at each iteration and decides the following update as a function of the current gradient and the previous momentum update.
This leads to: where w is the parameter, which decreases Q(w), η is the learning rate, and α is an exponential decay factor between 0 and 1 that controls the relative contribution of the current gradient and the previous one to update the current momentum. Unlike SGD optimizer, the SGDM optimizer tends to keep moving in the same direction to avoid oscillations.

• Root Mean Square Propagation (RMSprop)
Another optimizer is the RMSprop, which also breaks the learning rate using the average exponential decay of squared gradients. It depends on the momentum to minimize the loss function relatively faster. Like momentum, the RMSprop also tries to decrease the oscillations using another method. It automatically adjusts the learning rate by choosing a different one for each parameter. It calculates the running average using the mean square error. It also depends on the past gradient to estimate the learning rate.
where γ is the forgetting factor, and the updated parameters are given as:

• Adaptive Moment (Adam)
Adam algorithm merges the properties of momentum and some of the benefits of the RMSprop. Adam optimizer determines the adaptive learning rates for each parameter. Like momentum, Adam optimizer retains an exponential decay average of the past gradient descent ν t to reach a minimum faster, and stores an exponentially decaying average of previously squared gradients m t like RMSprop [27,28]. The decaying averages of past and past squared gradients m t and ν t are computed as: The Adam optimizer update rule is given by: where ε is a small quantity (e.g., 10 −8 ) utilized to avoid division by 0, β 1 (e.g., 0.9) and β 2 (e.g., 0.999) are the forgetting factors and second moments for the gradients, respectively. The benefits and limitations of different optimizers are summarized in Tab. 3. • Fast, robust, and flexible.
• Faster convergence and reduced oscillations.
• Little required memory specifications.
• High speed of convergence.
• One more variable is calculated for every update.

RMSprop
• An average of the squared gradient determines the diminishing learning rates.
• The magnitude of the previous gradient is employed to normalize the current gradient.
• Learning rate is updated automatically.
• The gradients square positive accumulation may reduce the learning rate, significantly.

Adam
• Easy to realize.
• Small memory requirements.
• Appropriate for parameter problems and massive data.
• As in RMSprop, instead of learning rates established on the first average moment (the mean), the average gradient of the second moment is used to adapt the parameters.
• The conflict in the optimization landscape is reduced.
• Low values of the second moment.
• The update procedure exceeds an ideal result as a result of a high divergence and learning rate.

Material and Methods
This motivation of this work is to offer a proposed simple deep CNN structural design for categorizing and classifying COVID-19 and Non-COVID-19 cases. This section describes all datasets used in this paper. In this study, simulation experiments are conducted on 1000 chest CXR and 1000 CT images of COVID-19 and Non-COVID-19 obtained from the open-source Mendeley datasets [29].
The dataset is divided into a 70% training set and a 30% validation set. The partitioned datasets of the training and testing help in data cross-validation. The cross-validation checks whether the suggested classifier precisely classifies the normal vs. COVID-19 images or not. A sample of employed datasets is shown in Fig. 1.

Proposed Deep CNN Model
If we examine the performance of a CNN, it is evident that the network performance is enhanced with the increase in network depth. This comes at the cost of large memory requirements. We try in the proposed deep learning model to make a trade-off between network size and network performance. The proposed CNN model is made up of 14 layers, as illustrated in Fig. 2. The input image size is 227 × 227 pixels, and it is fed into the first convolution layer that has eight filters with size 3 × 3 and stride 1. The input image is zero-padded to get the output image size the same as that of the input image size. The output is fed into the ReLU function, and finally, it is max-pooled with a window size 2 × 2 and stride of 2 to down-sample the image. These layers are followed by two similar structures. The first one depends on 16 filters of size 3 × 3 and stride one, and the second depends on 32 filters of size 3 × 3 and stride 1 also. The last max-pooling layer is eliminated. We use a SoftMax classifier to convert each class score into a probability distribution, and then use the cross-entropy as the loss function. The first performed experiment is for performance comparison of different activation functions on the two used datasets. The tested neural networks are carried out for six epochs with batch sizes of 8, 16, and 32. In the first tested scenario, the analyzed neural networks are equipped with the SGDM, RMSprop, and Adam optimization techniques. The utilized learning rate is 10 −5 for the three optimization algorithms. The learning rate is kept constant in the simulation tests, but the network structure and the activation functions are variable. The obtained results using the sigmoid function on the CXR dataset are shown in Tabs. 4-6, and Tabs. 13-15 for the CT database. In addition, the results of the Tanh function applied on the same dataset are shown in Tabs. 7-9 and Tabs. 16-18 for the CT database. Finally, the ReLU function results are shown in Tabs. 10-12 and Tabs. 19-21 for the CT database.
The second performed experiment is for performance comparison of the Adaptive (Adam), Root Mean Square propagation (RMSprop), and Stochastic Gradient Descent with Momentum (SGDM) optimizers at a fixed learning rate of 10 −5 . All neural networks run on the two datasets, using the previously mentioned activation functions. Additionally, 1, 3, and 6 epochs allow assessment by averting duplicate accuracy values and avoiding overfitting cases.
• Performance of the Proposed Model on CXR Database                  In this section, the effect of combining different optimizers with activation functions is studied and analyzed for improved COVID-19 detection. So, the third experiment scenario is the performance comparison of combining different optimizers and activation functions. The CXR dataset shows that the combination of SGDM with the ReLU activation function gives the best accuracy for 16 mini-batch sizes and a learning rate of 10 −5 . The training process and the confusion matrix are shown in Fig. 3. Therefore, the employed neural network with a combination of the SGDM optimizer and the ReLU function work better than other combination scenarios. Thus, the SGDM/ReLU configuration can find a smaller local minimum with few epochs. Performing the same test on the CT database, it is also proved that combining the SGDM optimizer with the ReLU activation function gives the best accuracy for 16 mini-batch sizes and a 10 −5 learning rate. An accuracy of 91.67% is achieved on the CXR dataset (with 93.3% Precision, 93.1% Sensitivity, and 90.3% Specificity). It is increased to 100% on the CT dataset (with 100% Precision, 100% Sensitivity, and 100% Specificity).
• Result Discussion This paper concentrates on the benefits of using different activation functions and optimizers to build a model that can classify the COVID-19 from CXR and CT medical images. The test findings reveal that the suggested deep CNN model is very effective and helpful in discovering and classifying COVID-19 cases. It is recommended to use a CT scan, because the best classification results can be obtained on CT images. The CXR dataset can be increased in size for more improved classification accuracy. It is shown that the main advantage of the sigmoid function is that it is easy to implement on shallow networks. Its output value is in the range of 0 to 1, when the input is in the range of −∞ to +∞. Hence, the activation value does not vanish.
Conversely, the sigmoid function is not suitable, when the neural network is initialized for small weights. The Tanh function outperforms the sigmoid function as it gives a superior performance. It has a steeper derivative leading to fast learning. Similar to the sigmoid function, the Tanh function suffers from the vanishing gradient problem. Sigmoid and Tanh functions activate the majority of the neurons in the same way. The ReLU function is preferred over the sigmoid function or Tanh function with generalized increased computation speed, since it does not depend on exponentials or divisions. However, the ReLU function has a restriction that it overfits compared to the sigmoid function. The SGDM optimizer can find a less minimum without overshooting on fewer epochs. Unlike the SGD optimizer, the SGDM optimizer tends to move in one direction to avoid oscillations. The Adam optimizer is another optimizer that determines the adaptive learning rate of first and second moments for each parameter. It also decreases the learning rates. Adam optimizer can be viewed as a combination of momentum and RMSprop. It also carries out the exponential moving gradients mean to update the learning rate instead of a simple average as in RMSprop. It maintains an exponentially decaying average of previous gradients, and is computationally effective with little memory specifications.

Conclusions and Future Work
This paper revealed the benefits of using different activation functions and optimizers to build a model capable of identifying COVID-19 cases based on CXR and CT images. Three optimization algoeithms, namely SGDM, RMSprop, and Adam, have been studied. These optimizers are often described as adaptive optimizers, because the learning step is modified corresponding to the contour topology. Out of the above three algorithms, it is found that the SGDM is the best algorithm. Simulation results revealed that all algorithms can converge to various optimal local minima offered by the same loss. Adam optimizer combines the best attributes of the momentum and RMSprop algorithms. It is relatively easy to configure and it can handle sparse gradients. The simulation outcomes demonstrated that the proposed deep CNN approach is valuable and cost-effective in discovering COVID-19 cases. The simulation findings can be enhanced for a future plan, when acquiring massive CXR images and CT images.