A Study on Small Pest Detection Based on a CascadeR-CNN-Swin Model

: This study aims to detect and prevent greening disease in citrus trees using a deep neural network. The process of collecting data on citrus greening disease is very difficult because the vector pests are too small. In this paper, since the amount of data collected for deep learning is insufficient, we intend to use the efficient feature extraction function of the neural network based on the Transformer algorithm. We want to use the Cascade Region-based Convolutional Neural Networks (Cascade R-CNN) Swin model, which is a mixture of the transformer model and Cascade R-CNN model to detect greening disease occurring in citrus. In this paper, we try to improve model safety by establishing a linear relationship between samples using Mixup and Cutmix algorithms, which are image processing-based data augmentation techniques. In addition, by using the ImageNet dataset, transfer learning, and stochastic weight averaging (SWA) methods, more accuracy can be obtained. This study compared the Faster Region-based Convolutional Neural Networks Residual Network101 (Faster R-CNN ResNet101) model, Cascade Region-based Convolutional Neural Networks Residual Network101 (Cascade R-CNN-ResNet101) model, and Cascade R-CNN Swin Model. As a result, the Faster R-CNN ResNet101 model came out as Average Precision (AP) (Inter-section over Union (IoU) = 0.5): 88.2%, AP(IoU = 0.75): 62.8%, Recall: 68.2%, and the Cascade R-CNN ResNet101 model was AP(IoU = 0.5): 91.5%, AP (IoU = 0.75): 67.2%, Recall: 73.1%. Alternatively, the Cascade R-CNN Swin Model showed AP (IoU = 0.5): 94.9%, AP (IoU = 0.75): 79.8% and Recall: 76.5%. Thus, the Cascade R-CNN Swin Model showed the best results for detecting citrus greening disease.


Introduction
According to the "World Agricultural Organization's research results" on citrus greening disease, trees or leaves infected by the infestation of greening disease should be promptly discarded. Tolerant trees and uninfected seedlings must be used, and transmission cannot be stopped unless all diseased seedlings are found and removed from the orchard. Through these measuresand management, the loss cost of the citrus orchard can be reduced as much as possible [1]. It is possible to manage such orchards and to manage pests by using artificial intelligence. Deep learning, which currently belongs to the scope of artificial intelligence, is widely used in image recognition and classification research. The convolutional neural network (CNN or ConvNet) [2] has shown excellent performance in classifying damage and diseases of crops such as pears, peaches, apples, grapes, and tomatoes in agricultural research [3].
Based on the related CNN model, researchers developed a detection system for citrus with greening disease [4]. In addition, another researcher developed a system which can confirm images of diseases for 26 plants, including citrus greening disease, and the overall detection accuracy was also high [5]. Another researcher established a model to detect citrus greening disease through a neural network and proposed four classifications for 8 categories of abnormal symptoms, and among them, the accuracy of detecting citrus greening disease was 93.7% [6].
The research direction of this paper is to find a method for the early detection and prevention of pests by examining greening disease of citrus trees using the Cascade Region-based Convolutional Neural Networks (Cascade R-CNN) Model [7]. The difficulty in identifying the existing citrus greening disease is that, as shown in Fig. 1, the greening disease vector pest is too small to identify with a system with low performance and it is very difficult to collect learning data. The data to be used in this paper needs to be enlarged on the screen to accurately detect the very small pests, and the accuracy is lowered because the amount of collected data is very small. To solve this problem, this paper uses a transformer model which uses a method called 'Self-Attention' [8]. Self-Attention was created to overcome the limitations of recurrent neural network (RNN) [9], which was slow in operation due to difficulty in parallel processing. We want to use the Cascade R-CNN Model to detect citrus infection with greening disease.
The purpose of this study is to achieve a better result in detecting some small pest targets or some obscured disease targets by using the Cascade R-CNN target detection model based on swin transformer feature extraction network with a small amount of original sample data.
Here it can effectively prevent overfitting and false positives caused by a fixed intersection over union (IoU) threshold that is too high or too low and based on the ImageNet data set, transfer learning and stochastic weight averaging (SWA) [10] are used to achieve higher accuracy. Here it can effectively prevent overfitting and false positives caused by a fixed IoU threshold that is too high or too low and based on the ImageNet data set, transfer learning and stochastic weight averaging (SWA) [11] are used to achieve higher accuracy.
In this study, we can solve Adam W's Loss bounding problem by optimizing the parameter update using stochastic weight averaging (SWA) to improve the stability of the model parameter update This paper is organized as follows: Section 2 describes the structure of the model proposed in this study; Section 3 describes the configuration of the whole system to be studied; Section 4 describes the experimental results; and finally, Section 5 discusses the conclusion of this paper.

Swin Transformer
Modeling in computer vision has long been dominated by convolutional neural networks (CNNs) [12]. However, the evolution of network architectures in natural language processing (NLP) has taken a different path, where the prevalent architecture is today instead of the transformer [13]. Designed for sequence modeling and transduction tasks, the transformer is notable for its use of attention to model long-range dependencies in the data. Its tremendous success in the language domain has led researchers to investigate its adaptation to computer vision, where it has recently demonstrated promising results on certain tasks, specifically image classification [14] and joint vision-language modeling.
The researched a new vision transformer by 2021, called the swin transformer, which capably serves as a general-purpose backbone for computer vision. Challenges in adapting transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, researcher propose a hierarchical transformer whose representation is computed with shifted windows [15].
The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection [16]. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

Cascade R-CNN
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives [18]. An object detector, trained with a low IoU threshold at 0.5, usually produces noisy detections. However, detection performance tends to degrade when increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples and 2) inference-time mismatch between the IoU, for which the detector is optimal, and those of the input hypotheses. A multi-stage object detection architecture, the Cascade Regionbased Convolutional Neural (Cascade R-CNN), is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector [19].
The Cascade Region-based Convolutional Neural Network (Cascade R-CNN) is very similar to the Faster Region-based Convolutional Neural Network (Faster R-CNN) and is largely divided into two steps, the first of which is to locate the target and the second being to classify the target. Referring to Fig. 3, the Faster R-CNN [20] classifier is on the left and the Cascade R-CNN classifier is on the right. In Faster R-CNN, pool is a pooling layer [21] for feature maps [22]. FC1 is the fully connected layer, B0 is the boundary box of the candidate region, B1 is the predicted boundary box from the structure, and C1 is the final prediction classification result. The pool of Cascade R-CNN is a pooling layer for the feature map. FC1, FC2, and FC3 represent the complete connected layer, and B0, B1, and B2 represent the bounding box of the candidate region. B3 indicates the predicted bounding box of the structure, C1 and C2 indicate the predicted classification result, and C3 is the result for the final predicted classification. Since the Cascade R-CNN classifier uses the cascade method, better data can be provided to the next classifier because the output value of the previous classifier is used as the input value of the next classifier. For this reason, the classifier has the advantage of showing higher effectiveness [23].
Cascade R-CNN uses three cascade detectors in the classification and regression stages, and by gradually increasing the IoU threshold, the candidate frame is continuously optimized and the detection result becomes more accurate. Additionally, it can effectively prevent overfitting and false positives caused by a fixed IoU threshold that is too high or too low [23].

Citrus Greening Bottle Detection System Using the Swin Transformer Model
The transformer used in this paper uses a method called 'Self-Attention'. The transformer's attention was created to overcome the limitations of Recurrent Neural Network (RNN), which was slow in operation due to difficulty in parallel processing. To translate a given word, it has to be compared against all other words in the sentence. Transformers do not need to process data sequentially like RNNs. This approach is also possible because it allows much more parallelism than RNNs. The transformer, which translates entire sentences in a parallel structure to increase similarity by making associations even with distant words, enhances language comprehension ability when learning deep learning by supplementing the RNN model. This paper detects the target of greening disease or diseased seedlings in citrus orchards through the transformer model. Here, high-definition cameras or drones are used to collect image data from the citrus orchard. The collected video is converted into a static image by the frame technique, and data is collected through labeling for effective images. Finally, leaves and pests overlapping with citrus greening disease are detected by other target detection models, and the detection performance is compared by comparing the Cascade R-CNN Model, the Cascade R-CNN-ResNet (Residual Network)101 model, and the Faster R-CNN-ResNet101 model.

Structure of the Proposed System
The proposed system architecture design and components according to the citrus greening disease detection network model architecture design are described.
The citrus greening bottle system architecture is implemented with image acquisition, image enhancement, real-time target detection, data warehouse storage, and web visualization. The process of the overall system architecture is as shown in Fig. 4: (i) the image acquisition part uses a drone equipped with a high-definition camera to capture the citrus orchard and transmits it to the image processing part; (ii) in the image enhancement part, there are image data preprocessing and data enhancement; (iii) the real-time target detection detects the leaves and pests of trees with citrus greening disease in the image; (iv) visualization is performed and risk warning occurs when detecting diseased leaves and pests in the real-time target detection part; and (v) daily data detection results and images are stored in the data warehouse.

Fig. 5 is a learning process for citrus greening disease detection based on Cascade R-CNN Swin.
First, we selected a high-quality image from a citrus orchard and built a data set through labeling. The number of original image data is 725, so learning can occur with very little data. However, in this paper, data augmentation methods such as Random Flip, Random Rotate, and Grid mask are used to solve the problem of too little data. The data quantity using the augmentation method will increase the quantity by 250% compared to the original data set.  are attached to each block, 1/1/3/1 block are actually repeated by grouping them as a set. H/4 x W/4 x C, written on each stage, is patch x patch x channel, where 48 is obtained as the initial patch size x channel (4x4x3), and C uses 96 in the base model swin-transformer. The swin transformer block replaced the vision transformer's Multi-head Self-Attention (MSA) with Window Multi-head Self Attention (W-MSA) and Shifted Window Multi-head Self Attention (SW-MSA). The reason for the replacement is that MSA is a standard of the self-attention process of the transformer, but if it is used in an image, its cost is very high because each pixel goes through a process of referencing the entire pixel value on the image. Here, W-MSA divides the image into 4 windows and performs self-attention for each window. SW-MAS is a shifted window MSA, which shifts by half the size of the window sizes wH and wW [23]. It is composed of three stages, and it is said that more stages would adversely affect performance. The cascade structure is not only applied to the train, but the cascade structure shown in Fig. 5b is also used for inference.

Development Environment
In this study, the development environment for the experiment was developed using Python's 3.7 version, and the PyTorch-based MMDetection API was used for the artificial intelligence library. In the training and test environments, the OS was Windows 10, the CPU was i9-9900k, the RAM was 128 GB, and the GPU was NVIDIA RTX 6000. The detailed development environment is shown in Tab. 1.

Image Reinforcement Learning
Histogram equalization generally increases the overall color sharpness of an image when the image is represented by a narrow range of intensity values. By evenly applying the overall color sharpness, it is possible to flatten the intensity of the histogram. Therefore, an area with low color sharpness can achieve high color sharpness in the surrounding area, and good results are obtained in light or dark images. The histogram equalization image enhancement result is shown in Fig. 6. Figure 6: Comparison of the differences before and after histogram equalization filter processing Tab. 2 describes the performance results obtained by training the model for detecting greeningdiseased citrus using the existing data set and the histogram averaged data set. As a result of the detection of citrus fruits with greening disease using the existing data set, Average Precision (AP) (IoU = 0.5) was 82.2%, Average Precision (AP) (IoU = 0.75) was 60.4%, and Recall was 68.2%. The results of detecting citrus with greening disease using the histogram normalization treatment data set improved AP (IoU = 0.5), AP (IoU = 0.75), and Recall by 3.85%, 2.96%, and 3.32%, respectively. Experimental results show that image enhancement can increase the diversity of features and improve the accuracy of training results based on the original data. These results show that the performance of the neural network model is linearly and positively related to the number of training samples.

Data Augmentation
Random Flip, Random Rotate 90, and Grid data augmentation methods used to make the network model have various characteristics of the data set before training to detect greening-diseased citrus. In addition, the stability of the model is improved by establishing a linear relationship between samples by utilizing the Mixup algorithm [24].

Results of the Detection of Citrus Infected with Greening Disease
The learning loss can be confirmed in Tab. 4 with the results of the training process to detect citrus with greening disease using the Cascade R-CNN Model used in this study.  On the other hand, the result of comparing the Cascade R-CNN Model with the Swin-Transformer backbone and the Cascade R-CNN Model with the ResNet101 backbone was as high as 3.4% for AP (IoU = 0.5) and as high as 12.6% for AP (IoU = 0.75). Recall was also as high as 3.4%.

Conclusion
This paper used the Cascade R-CNN Model to detect greening disease in citrus fruits.
In addition, the backbone of the model was selected as a feature extraction model using the Swin Transformer Neural Network based on the Transformer. Thus, it was possible to have the effect of expanding the range of the reception field of each network layer by extracting more detailed and multiscale features.
In this paper, in terms of model research, the FasterR-CNN Model was selected to compare the performance of the Cascade R-CNN Model, and the backbone of the two algorithms was set to ResNet-101 for comparison. Cascade R-CNN, a Swin-Transformer backbone, was compared with Cascade R-CNN, a ResNet101 backbone. Through this, the Cascade R-CNN Model, a Swin-Transformer backbone, showed higher performance in detecting greening disease than the Faster R-CNN Model, which is the ResNet101backbone, and the Cascade R-CNN, which is the ResNet101backbone. Therefore, the superiority of the Cascade R-CNN Model, the Swin-Transformer backbone proposed in this paper, was demonstrated. Based on the results of this study, we intend to further develop a model necessary for agriculture.
Funding Statement: This research was supported by the Honam University Research Fund, 2021.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.