VISPNN: VGG-Inspired Stochastic Pooling Neural Network

: Aim Alcoholism is a disease that a patient becomes dependent or addicted to alcohol. This paper aims to design a novel artificial intelligence model that can recognize alcoholism more accurately. Methods We propose the VGG-Inspired stochastic pooling neural network (VISPNN) model based on three components: (i) a VGG-inspired mainstay network, (ii) the stochastic pooling technique, which aims to outperform traditional max pooling and average pooling, and (iii) an improved 20-way data augmentation (Gaussian noise, salt-and-pepper noise, speckle noise, Poisson noise, horizontal shear, vertical shear, rotation, Gamma correction, random translation, and scaling on both raw image and its horizontally mirrored image). In addition, two networks (Net-I and Net-II) are proposed in ablation studies. Net-I is based on VISPNN by replacing stochastic pooling with ordinary max pooling. Net-II removes the 20-way data augmentation. Results The results by ten runs of 10-fold cross-validation show that our VISPNN model gains a sensitivity of 97.98 ± 1.32, a specificity of 97.80 ± 1.35, a precision of 97.78 ± 1.35, an accuracy of 97.89 ± 1.11, an F1 score of 97.87 ± 1.12, an MCC of 95.79 ± 2.22, an FMI of 97.88 ± 1.12, and an AUC of 0.9849, respectively. Conclusion The performance of our VISPNN model is better than two internal networks (Net-I and Net-II) and ten state-of-the-art alcoholism recognition methods.

Nevertheless, the diagnosis of alcoholism is mainly based on manual observation of brain images in the current clinical routine, which is naturally labor-intensive and onerous. Mainly, the slight shrinkage of alcoholism brains in the early prodromal stage [8] associated with mild symptoms is susceptible to be neglected by radiologists and clinicians, which may trigger costs to the patient and his/her family. In light of the above limitations, accurate and fast diagnostic artificial intelligence (AI) models to recognize alcoholism are beneficial to patients, families, and society.
In the past, various AI models have been proposed to recognize alcoholism. Fig. 1 shows the relationship between AI with machine learning (ML) and deep learning (DL). ML is a subfield of AI, while DL is a subfield of ML. Hou [9] brought about a novel algorithm-Predatorprey Adaptive-inertia Chaotic Particle Swarm Optimization (PACPSO). The authors applied the PACPSO to identify alcoholism. Jenitta et al. [10] presented a local mesh vector co-occurrence pattern (LMVCoP) feature for assisting diagnosis. This method can be used for the application of alcoholism identification. Han [11] proposed a three-segmented encoded Jaya (3SJ) method to identify alcoholism. The authors found the 3SJ gave better performance than other optimization methods, such as multi-objective genetic algorithm, plain Jaya, bee colony optimization, particle swarm optimization, etc. Lima [12] presented a novel method utilizing Haar wavelet transform (HWT) to extract features from brain scanning images of patients. Their method achieved an accuracy of 81.57 ± 2.18% on their dataset. Afterward, Macdonald [13] presented a wavelet energy logistic regression (WELR) model. The authors used 5-fold stratified cross-validation to verify the performances of their model. Qian [14] proposed a computer vision-based technique that utilizes cat swarm optimization (CSO), which mimics the behaviour of the cats. In their experiment, CSO was demonstrated to have better performances than four bio-inspired algorithms. Chen [15] presented a new model combining support vector machine (SVM) with genetic algorithm (GA). The combined model is abbreviated as SVMGA. The authors stated their model was effective in alcoholism detection, showing an average accuracy of 88.68 ± 0.30%. Chen [16] presented an AI model based on a linear regression classifier (LRC) for alcoholism detection.
Recently, DL techniques have been successfully applied to alcoholism recognition. Lv [17] created a 7-layer convolutional neural network (CNN). Their experiments showed stochastic pooling (SP) provided better performance than other pooling methods. Nevertheless, their CNN structure was simple, so its expression ability was limited. Xie [18] used the AlexNet transfer learning (ANTL) model. The authors fine-tuned their model and tested five different replacement configurations.
There are some other ML methods based on different data sources. For example, Kamarajan et al. [19] used random forest and Electroencephalogram (EEG) source functional connectivity, neuropsychological functioning, and impulsivity measures to classify alcohol use disorder. Quaglieri et al. [20] harnessed functional MRI (fMRI) to analyze brain network underlying executive functions in gambling and alcohol use disorder. Many other scanning modalities and protocols may help identify alcoholism; however, we focus on MRI in this study due to its high-resolution three-dimensional imaging ability.  The motivation of this paper is to propose a novel model, VGG-inspired stochastic pooling neural network (VISPNN), for alcoholism recognition with an expectation to obtain better performance than existing alcoholism identification approaches. The contributions of our study are the following four aspects.
(a) A VGG-inspired network is used as a mainstay network. (b) Stochastic pooling is used to replace traditional max pooling. (c) Improved multiple-way data augmentation is proposed to avoid overfitting. (d) Our model is proven to render better performances than state-of-the-art methods.

Introduction of VGG
Tab. 1 displays the abbreviation list in this study for ease of reading. First, we introduce VGG, which stands for Visual Geometry Group, an academic group at Oxford University. The VGG team presented two renowned networks: VGG-16 [21] and VGG-19, encompassed as library packages of prevalent programming language platforms, e.g., Python and MATLAB.
which means "a repetitions of b kernels with sizes of c × c followed by a max-pooling with a size of d × d." Note that (i) the activation function: rectified linear unit (ReLU) layers are skipped in the subsequent texts as default. (ii) Stride and padding are not reported because they can be calculated easily. The five CBs are itemized in Tab. 3, and the feature map (FM) of the output is displayed in the final column. After five CBs, the FM is compressed from 7 × 7 × 512 to a vector with a size of 25,088 neurons. Three FCLs with neurons of 4096, 4096, and 1000 are appended at last.   1R e L Ul a y e r 5 one max-pooling layer with a size of 2 × 2

Improvement I: Stochastic Pooling
Within the standard CNNs, pooling is an essential component followed by a convolution layer (See Layer 5 in Tab. 2) to reduce the size of FMs. Traditional pooling methods are either maxpooling (MP) or average pooling (AP).  Fig. 3).

Figure 3: Diagram of block-wise processing
The strided convolution (SC) traverses the input activation map with the strides, which equals the size of the block (Q 1 , Q 2 ), so here its output is set to The l 2 -norm pooling (L2P), average pooling (AP), and max pooling (MP) generate the l2norm value, average value, and maximum value within the block B m 1 ,m 2 , respectively. Nevertheless, the AP outputs the average, downscaling the greatest values, where the essential features may lie. In contrast, MP stores the most significant value but deteriorates the overfitting obstacle.
Alternatively, stochastic pooling (SP) provides a way to the defects of AP and MP. Successful application cases are using SP in the stochastic resonance model [22], COVID-19 recognition [23], etc. SP is a four-step procedure.
Step 1 generates the probability map (PM) for each entry in the block B m 1 ,m 2 .
where S (x, y) stands for the PM value at pixel (x, y).
Step 2 creates a random location vector (RLV) r = (x r , y r ) that takes the discrete probability distribution (DPD) as where Z represents the probability.
Step 3, a sample location vector r 0 is drawn from the RLV r, and we have r 0 = x r 0 , y r 0 .

Improvement 2: VGG-Inspired Stochastic Pooling Neural Network
A novel VGG-inspired mainstay network is proposed. Tab. 4 shows the structure of the proposed 10-layer VGG-inspired mainstay network. The definition of (a, b, c, d) can be found in Tabs. 2 and 3. The variables (e, f ) represents the weights and biases of FCL, respectively. The NWL in Tab. 4 represents the number of weighted layers. The total weighted layers in this VGGinspired mainstay network is 1 + 2 + 3 + 2 + 1 + 1 = 10. We can observe that after four CBs, the size of output FM is 11 × 11 × 64, which is flattened to a vector of 7744, which is then sent through an FCL with 100 neurons, finally outputting two neurons indicating alcoholism or healthy.
The structure of our VGG-inspired mainstay network is displayed in Fig. 5a. If we replace the max-pooling in each CB with stochastic pooling, we can get the proposed VGG-Inspired Stochastic Pooling Neural Network (VISPNN), as shown in Fig. 5b.

Improvement 3: 20-Way Data Augmentation
The dataset in this study was reported in Ref. [18], composed of 188 alcoholic brain images and 191 non-alcoholic brain images. Fig. 6 shows two samples of our dataset.
The relatively small dataset may breed the overfitting problem. To avoid this, data augmentation (DA) is a powerful tool because it can generate fake images on the training set. Cheng [23] presented a 16-way DA, in which 8 DA techniques were applied on both q (k) and q (h) (k). The multiple-way DA shows better performance than traditional DA. This study is based on 16-way DA from Cheng [23]; furthermore, we add two new DAs on both q (k) and q (h) (k). One DA is speckle noise (SN), which alters the image as where N SN is uniformly distributed random noise. The mean and variance of N SN are set to 0 and 0.05, respectively. The other DA is Poisson noise (PN). In the electronics field, PN originates from the discrete nature of the electric charge. Instead of adding artificial noise to the raw image, we generate PN from the raw image. The pixel values of raw images are stored in uint 8 format; if a pixel has the value of 20, then the corresponding pixel q PN of the PN altered image will be generated from a Poisson distribution with a mean of 20. Mathematically where P o (λ) represents a Poisson distribution with a mean of λ, and (x, y) are the coordinates. q t is the temporary variable. The min function helps the final output is within the value of [0, 256]. Using a colourful natural image can observe how those two noises alter the image, as shown in Fig. 7.
Let M 2 stand for the size of generated new images for each DA method, we have Second, horizontally mirrored image is generated as: where y a stands for horizontal mirror function.
Third, all the M 1 different DA methods are performed on the mirrored image q (h) (k). We generate M 1 new datasets as: Fourth, the raw image q (k), the mirrored image q (h) (k), all the above M 1 -way results of raw image K m [q (k)], and all M 1 -way DA results of horizontal mirrored image K m q (h) (k) are combined. The final generated dataset from q (k) is defined as G (k): where y b stands for the concatenation function. Suppose augmentation factor is M 3 , which stands for the number of images in G (k), we obtain Algorithm 1 summarizes the pseudocode of the proposed 20-way DA method. In this study, we set M 1 = 10, i.e., a 20-way DA. We also set M 2 = 30, thus M 3 = 602, indicating each raw training image will generate 602 images, including the raw image itself. Step 1 Import raw preprocessed the k-th training image q (k).
Step 2 M 1 geometric or photometric or noise-injection DA transforms K m are utilized on q (k).

Implementation
R-fold cross-validation is employed. The whole dataset is divided into R folds [24]. At r-th trial, 1 ≤ r ≤ R, the r-th fold is picked up as the test, and the rest R − 1 folds: [1, . . . , r − 1, r + 1, . . . , R] are chosen as training set (Fig. 9). We let R = 10, namely a 10-fold cross validation. Furthermore, we run the 10-fold cross validation ten times.

Measures
Seven measures are used based on the confusion matrix of 10 runs of 10-fold cross-validation. Let I stands for the confusion matrix where I (1, 1) means true positive, I (1, 2) false negative (FN), I (2, 1) false positive, and I (2, 2) true negative. Sensitivity, specificity, precision, and accuracy are already familiarized to readers, so we will not give their definitions. Besides, we use the F1 score, Matthews correlation coefficient (MCC), and Fowlkes-Mallows index (FMI).
The receiver operating characteristic (ROC) curve is used to provide a graphical plot of measuring AI models. First, the ROC curve is produced by plotting the true positive rate against the false-positive rate at various threshold levels. Then, the area under the curve (AUC) is calculated via ROC.

20-Way DA Results
Fig . 10 shows the M 1 -way DA results of raw image, which is chosen as Fig. 6a. Due to the page limit, we do not display the horizontally mirrored image and the corresponding M 1 -way DA results.

Statistical Results of Proposed Method
Tab. 5 itemizes the statistical results (10 runs of 10-fold cross-validation) of the proposed VISPNN method. The mean and standard deviation (MSD) over ten runs are displayed in the last row. It shows our model reaches a sensitivity of 97.98 ± 1.32, a specificity of 97.80 ± 1.35, an accuracy of 97.89 ± 1.11, respectively.

Ablation Studies
An ablation study is a procedural experiment that removes a network's submodule to understand that submodule better. Two ablation studies are carried out: (i) Net-I: We remove stochastic pooling from the proposed VISPNN model and use max-pooling to replace the removed layers. Thus, the network is named Net-I. (ii) Net-II: We remove the multiple-way data augmentation. The resultant network is named Net-II. The comparison of our VISPNN model with Net-I and Net-II is shown in Tab. 6. Fig. 11 displays the ROC curves comparison of the proposed VISPNN model with Net-I and Net-II. The blue patches correspond to the lower and upper confidence bounds. The AUC of Net-I is 0.9683, compared to that of VISPNN of 0.9849. Therefore, we can observe stochastic pooling indeed increase performances. Meanwhile, the AUC of Net-II is 0.9602, which is a significant drop from VISPNN (0.9849). This drop reflects that multiple-way data augmentation can significantly increase the prediction performance due to its ability to generate diverse "fake" training images.

Comparison to Other Alcoholism Recognition Methods
This proposed VISPNN model is compared with 10 state-of-the-art alcoholism recognition methods: PACPSO [9], LMVCoP [10], WRE [11], HWT [12], WELR [13], CSO [14], SVMGA [15], LRC [16], CNNSP [17], and ANTL [18], respectively. The comparison results are itemized in Tab. 7, with the cognate bar plot shown in Fig. 12 which ranks all the methods in order of MCC.    We can observe from Fig. 12 that the proposed VISPNN model beats all the other ten stateof-the-art methods in terms of all seven measures. The reason is three folds. First, the VGGinspired mainstay network gains many benefits by mimicking the similar structures from VGG-16. Second, stochastic pooling helps our model more robust than max pooling does. Third, the improved 20-way data augmentation generates diverse fake training images to help our model more resistant to overfitting.

Conclusions
To identify alcoholism more efficiently, we propose the VISPNN model based on a VGGinspired mainstay network, stochastic pooling technique, and an improved 20-way data augmentation. The results show that our model gains a sensitivity of 97.98 ± 1.32, a specificity of 97.80 ± 1.35, an accuracy of 97.89 ± 1.11, and an AUC of 0.9849, respectively. The performance is better than 10 state-of-the-art alcoholism recognition methods.
The limitations of this study are that this model does not go through strict clinician verification; also, the dataset is relatively small. Hence, we will try to collect more brain images of both alcoholism and healthy subjects. Meanwhile, we shall deploy our VISPNN model to the cloud server and invite clinicians and radiologists to use our web app and get their feedback to improve our model further.