Plant Disease Classification Using Deep Bilinear CNN

Plant diseases have become a major threat in farming and provision of food. Various plant diseases have affected the natural growth of the plants and the infected plants are the leading factors for loss of crop production. The manual detection and identification of the plant diseases require a careful and observative examination through expertise. To overcome manual testing procedures an automated identification and detection can be implied which provides faster, scalable and precisive solutions. In this research, the contributions of our work are threefold. Firstly, a bi-linear convolution neural network (Bi-CNNs) for plant leaf disease identification and classification is proposed. Secondly, we fine-tune VGG and pruned ResNets and utilize them as feature extractors and connect them to fully connected dense networks. The hyperparameters are tuned to reach faster convergence and obtain better generalization during stochastic optimization of Bi-CNN(s). Finally, the proposed model is designed to leverage scalability by implying the Bi-CNN model into a real-world application and release it as an open-source. The model is designed on variant testing criteria ranging from 10% to 50%. These models are evaluated on gold-standard classification measures. To study the performance, testing samples were expanded by 5x (i.e., from 10% to 50%) and it is found that the deviation in the accuracy was quite low (0.27%) which resembles the consistent generalization ability. Finally, the larger model obtained an accuracy score of 94.98% for 38 distinct classes.


Introduction
Agriculture is the only way for crop production and livelihood. One of the major risk factors of crop productions is dealing with plant diseases. Every single crop produced is linked with a plant disease, which is an obstacle for healthy crop production and this tops the list of reasons for the loss of crop production. If a crop has a plant disease then the symptoms can be noticed by keen observation of different parts of the leaves. Plant diseases are categorized into pests, weeds and plant pathogens. The annually estimated average loss due to pathogens and pests are nearly 13%-22% on the world's major crop productions like Rice, Wheat, Maize, Potatoes etc. Over the past few decades, farmers used to identify these diseases by observing the leaves through naked-eye. However, this requires the farmer to be extremely skilled or would require the guidance of an agricultural scientist to notice the disease and this process consumes a lot of time.
One of the major reasons for the loss in production is due to diseases like bacterial spot, early blight, late blight, and leaf mould that occur frequently on the leaves of the tomato plant at different stages of its growth. Potatoes are one of the most dominant food crops where the yield of potatoes is reduced by diseases Phytophthora infestans (late blight) and Alternaria solani (early blight). The average yield loss at a global level due to pathogens and pesticides is around 17.2% in potatoes. Apple is been one of the most produced fruits because of its nutritional and medicinal importance where the severity caused by the diseases (Mosaic, Rust, Brown spot, and Alternaria leaf spot) on apple leaves led to huge production and economic loss which also affected the quality in production. Early detection of these types of conditions in the plants that are unhealthy allows us to take precautionary measures and alleviate the production of crops.
In this research, we mainly focus on developing automatic, accurate and less expensive Restful-API into a Mobile-App to detect and classify variant kinds of leaves using Bi-Linear Convolution Neural Networks (Bi-CNNs). The contributions to the body of the knowledge are mentioned as, 1. We propose a bi-linear convolution neural network (Bi-CNNs) for plant disease identification and classification with a leaf images as input. 2. Secondly, we fine-tune VGG and pruned ResNets and utilize them as feature extractors and they're connected to fully connected dense networks. The hyperparameters are tuned to reach faster convergence and obtain better generalization during stochastic optimization of Bi-CNN(s). 3. Lastly, the proposed model is designed to leverage scalability by implying the Bi-CNN model into a real-world application and release it as an open-source. The detailed explanations of the product are mentioned in the last section.

Motivation
The design paradigm of the proposed architecture is motivated by the two distinct cortical pathways of the human brain. These two cortical pathways oblige to understand the object vision and spatial vision separately. The occipitoparietal, the dorsal system, helps understand the visual location of the targeted object whereas, the occipitoparietal, the ventral stream, extracts the visual representations of objects i.e., identifying the objects [1]. These two critical pathways extract the information regarding an object from retrieval input to the striate cortex at a juncture. But, the occipitotemporal pathway interconnects the striate the pre-striate and activated to inferior temporal regions which eventually helps to identify the visual stimulus understanding the physical properties of the targeted object.
classifying distinct plant leaves from their infection classes. The mathematical study is furnished in the methodology section.

Previous Research
Siddharth et al. [3] developed a model for the identification and classification of diseases in plant leaf images. The proposed model is based on Radial Basis Function Neural Network (BRBFNN) which uses bacterial foraging for optimization and increases the training speed of the network. The algorithm (Bacterial Forging) searches for the common attribute by the grouping of seed points for identifying the features. Their classification results are based on validation evaluation partition coefficient (Vpc) and validation evaluation partition entropy (Vpe). They also compared their model with traditional machine learning methods such as K-means and SVM. The specificity of the proposed model for segmentation is 0.558 and for classification upon Vpc and Vpe is 0.8621 and 0.1118 respectively.
Aydin et al. [4] has experimented on various transfer learning models to analyze the classification performance. During experimentation, they have four models and analyzed this performance on publicly available datasets. Additionally, they have proposed a CNN model adjoined with LDA to classify the deep featured extracted from the pre-trained network (AlexNet & VGG16). To analyze the performance of the model five-fold cross-validation procedure was adapted and the input size of an image was considered as 100 Â 100. The proposed model obtained an accuracy score of 96.93% whereas the pretrained VGG16 model outperformed with an accuracy of 99.80%. Karthik et al. [5] has researched tomato leaves diseases and proposed two variant CNN architectures. with the help of Residual Progressive Feature Extraction, the model has extracted spatial features with 0.6M parameters. By performing fivefold cross-validation the model obtained an accuracy of 98%.
Mohanty et al. [6] performed experimentation on plant village dataset by using AlexNet & GoogleNet. By contemplating variant types GoogleNet got an accuracy of 98% on grayscale, 99.34% on colour, and 99.25% on segmented images with 5 different splits on 20% test samples. Uday Pratap et al. [7] Multilayer Convolutional Neural Network (MCNN) for classification of mango leaves that are infected with anthracnose disease. The dataset consists of 1070 images of mango leaves which is a real-time dataset captured in a university. The proposed model was validated by 20% test samples and compared with various machine learning techniques. The MCNN model got an accuracy of 97.13%. Sandeep Kumar et al. [8] has implied a new optimization technique called Exponential Spider Monkey Optimization (ESMO) for plant disease identification. SPAM is applied for feature extraction to classify if a leaf is healthy or diseased. Distinct machine learning models are used to evaluate the performance, it was concluded that SVM outperformed other models with an accuracy of 92.1%.
Qiao Kang et al. [9] proposed a network that uses ResNet50 architectures as a backbone model to diagnose plant disease and severity estimation. The proposed network consists of shuffle units as an auxiliary structure which increases the performance of the model. The dataset was collected from the AI challenger Global AI contest which consists of 7 different plant species with an Image size of 256 Â 256 Â 3. The proposed method got a classification accuracy of 98% and a recognition accuracy of 99%.
Ahmed et al. [10] introduced a model called CaffeNet, It was built on a Caffe framework to label paddy pest and paddy diseases. This work used a database that has 9 paddy pests and 4 paddy disease classes. The aforementioned model was fine-tuned over 30,000 iterations and obtained an accuracy score of 87%. Islam et al. [11] utilized machine learning algorithm and image preprocessing techniques to segment and identify potato disease from the plant village database. It was observed that SVM had an accuracy of 95% (over 300 images).

Dataset Description
The complete data was collected from the open repository [12]. The dataset was publicly available for research and hence can be implied for classification and identification of distinct plant disease which can be achieved by designing a user-friendly mobile application. The experimentation is carried out in three folds. The First experimentation, which is named D 1 , was carried out in 18 classes. In which there are 9 variant fruit leaf images with healthy and unhealthy kinds. As a note, all the variants of unhealthy classes constructed in D 1 contains various types such as early blight, black rot, bacterial spot etc. This means all the infected kinds of plant leaves are considered as an unhealthy class of that particular leaf kind.
Next, the experimentation was carried out (D 2 ) where three variant plant leaf images are considered with their multiple infected classes. Lastly, the experimentation was carried out with (D 3 ) which contains the complete 38 classes. The description of the train and test for D 1 , D 2 , and D 3 is clearly illustrated below (refer to Tab. 1). The complete partitions of the dataset into D 1 , D 2 , and D 3 are mentioned in the repository 1 .

Feature Extraction Models
VGG Models were utilized while developing a Bi-CNN model where, the two models considered for extracting features are VGG16 and VGG19. The VGG16 and VGG19 contain consecutive convolution and pooling layers with a depth of 16 and 19 layers respectively. These models are pre-trained on the finest weights. The VGG16 and VGG19 models consume 138.3 and 143.6 Million parameters with an input shape of 224 Â 224 with three coloring channels. While developing the Bi-CNN model it is assumed that one of the VGG models do capture spatial invariances and the other captures the location of an entity residing in the image [13].
ResNet Models ResNet models are also considered for feature extraction. Two flavors of ResNet's are implied which are 50 layers deep and the other is 101 layers deep. These ResNet models can capture the invariances of the input by overcoming the problem of degradation. The deep residual connections help to rectify the vanishing gradients and regulate learning for deeper layers. These ResNets are not only computationally cheap but are much deeper and capable of capturing invariances with greater performance. Hence, the two flavors are implied by pruning the network appropriately [14].
The bottom layers with greater feature maps i.e., 2048 activations are excluded. The ResNet model is pruned by truncating the last four activations. During experimentation, it is observed that these activations (final activations) led to high computations. But the original model performance was higher than that of the pruned model. Hence, the ResNet model was pruned by truncating the last four layers Finally, these models (VGG and ResNet models) are fine-tuned with regulated training procedure to reach faster convergence with greater generalization.

Bi-Linear Convolution Neural Network (Bi-CNN's)
As mentioned, Bi-CNN's are motivated by the visual perception of the human brain through two cortical visual pathways. This motivation led to the design of a neural network that extracts spatial locations of the entity residing in an image and captures the structural invariances. So, to extract features bottleneck activations of the pre-trained network are utilized. A set of features for extracting spatial location and morphology of input are chosen as Net − X and Net -Y [15].
Whereas, in our methodology, we utilize not only pre-trained VGG models (as [15]) but also ResNet's intermediate activations by cautious architecture pruning. To classify D 1 only VGG models i.e., VGG16 and VGG19 are utilized. In the case of D 2 , Pruned ResNet and also VGG models are utilized. But for D 3 , only pruned ResNet is employed.
Which can be simplifies as, where represents the element-wise product for given two feature vectors.
For a clear understanding, the bottleneck activations extracted from the feature extracting models i.e., fv x and fv y are pooled by implying second-order pooling i.e., outer-product is applied to pool those features to linearize to form fv z as a feature vector. This feature vector contains the fine-grained features and the outer-product is appropriately described below. Next, to regularize the model normalization is adapted. This normalization is processed in three steps i.e., three normalization layers are sequentially attached. The first normalization layer is chosen as either natural logarithm of the square root of the individual features extracted from fv z . So, after the first normalization, the features are followed up with a signed square-root as a normalization step. The mathematical formulation for the signed square-root is described below. Thirdly, the feature vectors are regularized through an l 2 -normalization constraint. When the non-linearities are not regulated properly the maximum probability mass function for each sample is assigned as logit i.e., activated via sigmoid. These assignments certainly cause fragile activations with a high chance for saddle points.
This scenario can be prevented by choosing l 2 -normalization as the final activation layer. Hence, these three normalization layers can produce effective outcomes by transforming features into a regulated latent space.
Both the normalizations i.e., square-root and logarithm are chosen. As a note, both the normalizations are not chosen at a time. where, Next, these latent representations which capture the information regarding the whole image are to be classified appropriately. Tsung Yu et al. [15] implied SVD and LYAP methods for computing the matrices. They are not end-to-end trained neural architectures. They do not have GPU computation endto-end where; they only compute the extracted feature vectors by classifying them through CPU. For efficient end-to-end training, a fully connected neural network is chosen with successive dropout and batch normalization layers. The major differences from Tsung Ye et al. [15] is, they implied only VGG as pre-trained network and did not imply end-to-end training via backpropagation. But, in this research, a better feature extractor i.e., ResNet's are implied with end-to-end training.
As a note, the feature-extraction method is the same for all the models mentioned in Tab. 1. But the final layer activations are varied from 3 to 38 depending upon the model. In the given Fig. 1, it can be observed that N is mentioned at final softmax activation which is chosen based on the specific model to be fine-tuned. The feature extraction part is fine-tuned and the classification part is fully trained. In this complete procedure, the gradient flow in the network is appropriately analyzed.

Architecture Parameters
This section gives a complete illustration of the hyperparameter tuning and optimization of model during the course of training Bi-CNNs. The classification architecture is built by fully connected networks with 256-128-N as the pattern. Where N is number of classes for discriminating the input. Firstly, the feed is activated via ReLU as non-linearity. Further, the batch normalization layer is implied to reduce the problem of covariate shift [16]. Next, a dropout layer is added as a regularization method which eventually reduces overfitting [17]. The drop ratio of neurons for reducing overfitting using dropout is chosen as 30% (which is chosen while optimization). The weights for the initialization of the learning procedure are chosen to be glorot-normal [18].
There are variant models trained. The D1 models are trained using negative log-likelihood as cost function (Loss1). Whereas, for both of the D2, andD3 squared hinge (Loss2) is used as a cost function for stochastic optimization of neural network. The equations are formulated as mentioned as, where i is no. of instances (feature samples) and k is no. of class labels. y (i) is the ground truth class labels for the class k with an i th instance. ŷ k (i) is the predicted class label. Further to optimize the model adam [19] is chosen as an optimizer with an initial learning rate of 0.0005. The training schedules are designed by motivating from the work by Samuel L et al. [20]. Where the noise during the training procedure is either reduced by increasing batch size or decaying learning rate (keeping the momentum variable to be constant). The noise during the training is mathematically understood as, noise / D train rate batch: ð1 À momentum The aim is to decrease noise during training either tweaking the learning rate schedules or batch size. So, slowly increase the learning rate by increasing batch size to obtain faster convergence with greater generalization ability. As the training samples are large regularization provided by l2 would be useful and accordingly learning rates are scheduled [21]. To provide appropriate learning with faster convergence and proper generalization a definite training schedule is obtained, where for every single iteration (≈ 4 epochs) batch size and learning rate are updated cautiously.
At the first iteration, the batch size is initialized as 10 (batch 0 ← 10) and the learning rate is initialized as 0.5 Â 10 -3 (rate 0 ← 0.5 Â 10 -3 ). Next, in the second iteration learning is increased to 0.001 and batch size is increased twice of the previous batch i.e., 20. Finally in the third iteration learning is held constant to  25. This iterative computation for stochastic optimization of a neural network is chosen for D1, D2, and D3 respectively. This eventually aided to reach faster convergence with cautious hyperparameter tuning as mentioned 2 [22].

Results
To determine the performance of the Bi-CNN's, classification metrics are implied based on their significance. The metrics such as accuracy score, Receiver operating characteristics (ROC) area under the curve (AUC) [23] and mean-squared error (MSE) are utilized to determine the performance. Accuracy is chosen as the gold standard metric to evaluate classification performance as it aggregates the instances which are correctly classified and divide them with complete instances which are both classified as correct and incorrect.
Next, MSE is utilized as a metric to determine the performance of regression models which observes the deviation from the ground truth to that of predicted instances. So, the deviation can also help determine the performance of the model even when ground truth labels are provided. Finally, AUC is calculated by plotting the true positive rate on the y-axis and the false positive rate on the x-axis. The performance evaluation is carried out by developing variant models as per Tab. 1. The complete result section is divided into two different sections and is are explained below.

Evaluated on D 1
In D 1 one of the models is produced either by applying square-root as the first normalization layer and the other as a logarithm. Each of the models mentioned in Tab. 1 is evaluated with various metrics. During the evaluation, it is observed that the Bi-CNN model with square-root as the first normalization layer outperformed the logarithm in all of the generalization's splits.
But, to know whether the proposed model is providing visual attention to the required regions or not heat maps are generated. These heat maps are also known as class activation maps (CAM's). Heat maps visually describe the final layer activations of the model and impart color to the highly activated regions (i.e., coloring the region of interest). So, to understand these activations from the bottleneck layer fv z heat map is plotted for three different plant leaf kinds (apple, corn, peach) with infection classes and visually depicted in Figs. 2 and 3. To implement the proposed methodology Tensorflow is utilized in the backend [22].
They have the unique property of being insensitive to alterations in the class distributions and can provide good relative instance scores. For handling multiple classes, ROC is calculated by considering one class (chosen) as positive and the remaining classes are considered to be negative ones. So, 18 different AUC-ROC curves are generated. These AUC-ROC curves are generated for the model Bi-CNN (sqrt), as its performance was optimal, for all the generalization splits and visualized in Fig. 4. Further, AUC-ROC curves [23] are generated for an individual class. AUC is used as a metric in Tab. 2 as they provide detailed characteristics of the classifier.

Evaluated on D 2 and D 3
In this section, D 2 , and D 3 models are evaluated with appropriate metrics. D 2 models are trained on highly imbalanced classes of plant leaves. The D 2 models consist of a single plant leaf with its healthy class and he remaining unhealthy classes.
To understand the model's performance, as mentioned, two variant feature extractors i.e., VGG and ResNet models are implied. When the model was fine-tuned as mentioned in Section 4.4, it is seen that ResNet models were able to extract invariant features and have good learning compared to that of VGG models. To see the convergence of the models D 2 set i.e., the plant leaves of tomato, potato, and maize (corn) are fine-tuned on both ResNet's and VGG models. It is observed that. For very few iterations (epochs = 25), the ResNet models were able to outperform VGG models in every scenario. The learning curves for the D 2 for both the feature extractors are plotted in Fig. 5. The performance of the individual D 2 model is illustrated in Tab. 3.
As observing the ability of ResNets, a set of 38 classes are end-to-end fine-tuned to compare the model's ability to generalize on unseen samples. This    Most of the previous research was held in extracting the features using machine learning methods that do not capture invariances for generic kinds i.e., these models are task-specific and required hand-engineered features. To overcome the problem of extracting features from hand-picked learning mechanisms deepvision approaches are imparted. Most of the previous work was on developing a convolution neural architecture for extracting features with precise optimization. But, after the evolution of transfer learning [24], many researchers tend to imply these pre-trained weights onto a similar task.
The advantage is, they do not require precise hyperparameter tuning and reduce the computational budget to a greater extent. Hence, these architectures provide pre-trained weights and many kinds of research apply these transfer learning techniques to extract innate bottleneck feature vectors from the given input to classify either using fully connected neural networks or utilizing machine learning classifiers such as SVM, decision trees etc. But, most of the research lack in providing appropriate visual attention to the models an important property of visual recognition task. Providing visual attention to the models can extract fine-grained features containing detailed and precisive information regarding each entity. Further, training or fine-tuning mechanisms should be appropriate for the model to converge fast and generalize well. These problems are addressed by providing a resilient model for capturing detailed invariances and providing 3 level generalizations sets with faster convergence. Further, a Restful-API [25] and mobile application is created for capturing real-time images and classifying them by inserting the proposed model in the back-end. The details about the Restful-API and the application are described in the next section.

Restful-API and Mobile Application
In this research, a deep learning model is created and deployed it as a Restful API and connected it to the mobile application for making predictions using an image as input. The REST architectural style is used because of its simplistic interface and modifiability of elements and its ability to adapt to changing needs (even when the application is running) portability of elements by moving data along the program code. It is also scalable for a large number of users. Fig. 6 shows the work-flow of the Restful API that is created. A deep learning model is deployed as REST API which is the most reliable and industrially practiced method for deployment of the deep learning model for making predictions remotely. As you can see in Fig. 6, API accepts base64 encoded image string as input and gives prediction result back in the JSON format. Flask is used for creating the API and also to act as a web server and gateway handler for the input request.
In the next stage, the input base64 encoded image string is passed for pre-processing in this stage the base64 encoded image string is decoded and converted back into the image format and also resized the image to 128 Â 128 Â 3 to be compatible with the model. After the fulfilment of the pre-processing stage, Figure 6: The complete mobile restful-API the image is passed as input to the model to make the prediction. The predicted result is passed as output in JSON format. The whole process is wrapped in a single function in API which accepts the base64 encoded image string and outputs the predicted result. We connected this API to a mobile application. Where the user can download the app and predict by uploading or by capturing an image through the camera. Fig. 1 show and walk you through different stages and flow of the mobile app. When an image is uploaded or captured it is converted into a base64 encoded image string and passed to the API handler to handle with proper function then the function runs on the server by invoking the model and passing given base64 encoded image string as the input to the function. This working process is carried out by flask framework using gunicorn as a server in the backend. The restapi was deployed on the Heroku platform. 3 The advancement in deep learning is leveraging the performance with the increase in size of data. This eventually led to designing an adaptable kernel [34] which would enhance the feature extraction process both for shallow and deep networks. It is observed that, smaller receptive field implied by VGG, a wavelet kernel [35] and Gaussian kernels [36] obliged in improving the quality of features acquired. The advancement in similarity metrics can acquire scalable features [37][38][39][40][41][42]. Transforming a sequence of redundant pixels using a specific similarity measure [43][44][45][46] is challenging and chosen as future scope for the present work.

Conclusion
This research provides unification with the extension of our conference work [32]. This study is motivated by the human visual cortex in designing an end-to-end trainable neural network named bilinear convolution neural network for plant leaf disease identification and classification. Bi-CNN models are developed for variant data divisions i.e., D 1 , D 2 , and D 3 . These models outperform the existing literature by extracting fine-grained features to provide visual attention through a second-order pooling mechanism. The model attained the highest accuracy score of 94.98% for D 3 , where 38 variant classes are considered. During this understudy, it is observed that ResNet when implied as a feature extractor outperform the VGG model and provide less computational expense with higher performance. Finally, the model is embedded in a mobile API and released as an opensource. Even with numerous advantages, the methodology didn't imply 10-fold cross-validation. The 10-fold cross-validation consumes high computational efforts. The proposed second-order pooling generally tend to provide attention only when bottleneck activations which of the same size. Whereas, a new pooling technique has to overcome this disadvantage. The present experiment was carried out on individual leaf images. In future, it is aimed to provide the solution by capturing aerial imaging techniques to extract a bunch of features from clustered plant leaf images with better precision and performance.