Crop Yield Prediction Using Machine Learning Approaches on a Wide Spectrum

: The exponential growth of population in developing countries like India should focus on innovative technologies in the Agricultural process to meet the future crisis. One of the vital tasks is the crop yield prediction at its early stage; because it forms one of the most challenging tasks in precision agriculture as it demands a deep understanding of the growth pattern with the highly nonlinear parameters. Environmental parameters like rainfall, temperature, humidity, and management practices like fertilizers, pesticides, irrigation are very dynamic in approach and vary from field to field. In the proposed work, the data were collected from paddy fields of 28 districts in wide spectrum of Tamilnadu over a period of 18 years. The Statistical model Multi Linear Regression was used as a benchmark for crop yield prediction, which yielded an accuracy of 82% owing to its wide ranging input data. Therefore, machine learning models are developed to obtain improved accuracy, namely Back Propagation Neural Network (BPNN), Support Vector Machine, and General Regression Neural Networks with the given data set. Results show that GRNN has greater accuracy of 97% (R 2 = 0.97) with a normalized mean square error (NMSE) of 0.03. Hence GRNN can be used for crop yield prediction in diversified geographical fields.

and contributes nearly 17%-18% of the GDP [1]. This sector significantly impacts the country's economy due to its contribution to exporting and the wide range of stakeholders involved. Moreover, food safety and security are paramount for a highly populated country like India. The United Nations has set up Zero hunger as one of its Sustainable Development goals to achieve a better and sustainable future [2]. All the sweat expended in the farming is to receive a high yield at the determined period to satisfy all its stakeholders.
Predicting the crop yield at the early stages will prepare the farmers to make sound decisions on the managerial and financial aspects to avoid last moment surprises and losses. Predicting the crop yield is a complex task due to its dependence on manifold factors in an interconnected facet. Fundamentally the yield of any crop depends on the soil features, environmental factors, applied nutrients, and field management [3]. Here the crop yield is a dependent variable while the other components are independent and interdependent variables making the yield prediction a complex task. Among these inter-dependent variables, environmental factors are highly arbitrary and vital in deciding crop yield.
Conventionally, the nutrients, pesticides, and irrigation are consistently applied irrespective of the environmental impacts and the other arbitral changes in the growing process that leads to a poor yield [4]. To overcome this issue, we first need to understand better the relationship between the input parameters and their interdependency important to the yield. A mathematical model has to be developed to equate the relationship of the independent variables and their coefficients with the crop yield. Secondly, we need to get time to time accurate status updates of the field to understand the strength of each variable at various growth stages. Third, by making sound decisions to control irrigation, climate change factors and enhance the nutrition of soil that increase the crop quality while ultimately lowering the effects on the environment leading to a high yield [5].
Formerly, researchers estimate the crop yield using statistical approaches, including the multivariate linear regression (MLR) technique. However, the prediction accuracy was not up to the expectation. Currently, machine learning (ML) approaches are growing as a powerful descriptive and predictive tool in handling complex research problems. Crop yield prediction is one of the challenging problems in precision agriculture, and many models have been proposed in the literature and validated so far. Crop yield prediction at its early stage is a difficult task. The Agricultural yield primarily depends on weather conditions (rain, temperature, etc.) and pesticides. Accurate information about crop yield history is essential for making decisions related to agricultural risk management and future predictions. Many studies have used statistical models such as regression, multivariate regression, and artificial neural networks for crop yield prediction with limited input parameters. The table below illustrates the exiting works relating to crop yield prediction using various methodologies and spectrums (Tab. 1).  [6] 2016 Weighted histograms regression -Proposed the design strategy for selecting soybean varieties to exploit maximum yield in the best season based on the knowledge attained from heterogeneous historical data. The outcomes with the existing regression algorithm proved that the proposed algorithm offered an optimal selection of seed varieties.
(Continued) -Focussed on analyzing the environmental constraints, namely area under cultivation, annual rainfall, and food price index that impacts the crop yield.
-RA analyzes the factors and groups them into explanatory and response variables that aid in attaining a decision. [8] 2017 Gaussian process component and spatio-temporal structure -Presented a scalable, accurate, and inexpensive technique to forecast crop yields using accessible remote sensing statistics (Open source).
-The proposed scheme improved the accuracy of the yield prediction pointedly along with a novel dimensionality reduction technique. [9] 2017 Generalized regression neural network and radial basis function neural network -The suggested method forecasted the yield of potato crops that were sown in flat and rough regions. Among the two methods, a generalized regression neural network was greater accuracy.
[10] 2017 Improved genetic algorithm-back propagation neural network prediction algorithm -The proposed algorithm was used to advance the yield-irrigation water model for forecasting the yield for various irrigation schemes under subsurface drip irrigation.
-It offered more precise predictions of the yield with an average error of only 0.71%. [11] 2018 Remote Sensing (RS) and Machine Learning (ML) algorithms -Discoursed research evolutions complemented within the last fifteen years on ML-based methods for precise crop yield prediction compared with RS approaches.
-Determined that the rapid expansions in sensing tools and ML techniques could bring price-effective and wide-stretching resolutions for enhanced crop yield and decision making. [12] 2019 Aggregated Rainfall-based Modular Artificial Neural Networks (ARMANN) and Support Vector Regression (SVR) -Predicted the magnitude of monsoon rainfall using MANN.
-Forecasted the level of chief Kharif crops yielded in view of the rainfall data and zone using SVR. [13] 2019 Support Vector Regression (SVR), K-Nearest Neighbour, Random Forest (K-NNRF), and Artificial Neural Network (ANN).
-Considered the agricultural dataset to cover 745 cases; among these, 70% of statistics are arbitrarily designated to train the approaches and the other 30% for testing the system model to assess the prediction capacity.
-Among the other comparative approaches, random forest (RF) presented the best correctness in yield prediction.  -Through the recommended approaches, superior prediction precision with an RMSE of 12% of the average yield and 50% of the standard deviation (SD) for the validation dataset considering predicted weather data. [15] 2019 Artificial NEURAL network (ANN).
-Among these models, multilayer perceptron offered the best prediction. [16] 2020 Hybrid Genetic Algorithm-based-Back-Propagation Neural Network (GA-BPNN) -The suggested model was adapted to provide complementary data on crop growth (maize) at the vibrant growth phase.
-The hybrid theory improves the crop yield pointedly compared with the original back-propagation (BP) approaches. [17] 2021 Support Vector Machine (SVM), Random Forest (RF), and Neural Network (NN) -Enriched vegetation index from MODIS and solar-induced chlorophyll fluorescence are used from GOME-2 and SCIAMACHY as metrics for crop yield prediction.
-ML schemes presented better crop yield prediction than the statistical method.
Further, Gu et al. [18] proposed a hybrid model using a back-propagation algorithm combined with a genetic algorithm for forecasting the corn yield for diverse irrigation systems and found the average error to be only 0.71%. Also, Kodimalar et al. [19] investigated a pool of machine learning techniques in the big data computing model and recommended SVM and ANN to be the most appropriate ML models for rice yield prediction. Furthermore, Maya Gopal et al. [7] found the Forward Feature Selection algorithm integrated with random forest algorithm to efficiently select the appropriate input parameters for accurate crop yield prediction. Moreover, Mohsen et al. [20] designed a few more ensemble models considering the complete and partial in-season weather knowledge with the blocked sequential procedure and achieved 9.5% RRMSE by the optimized weighted ensemble and the average ensemble models. Cai et al. [21] compared the regression-based methods with machine learning methods in their performance in Wheat yield prediction in Australia and concluded machine learning methods to have higher performance with R 2 as 0.75 at two months advance time before the wheat maturity time. Eventually, Ansarifar et al. [22] attempted to select the most tightfitting environmental and management parameters and to find the extent of interaction within them about the crop yield using the interaction regression model and achieved an RRMSE of less than 8%.
The rest of this paper is organized as follows. In Section 2, the dataset and site descriptions are provided along with each input parameter and the target value. In Section 3, the theory behind the statistical model and the machine learning models are explained. In Section 4, the performance of each model is discussed in detail, and Section 5 concludes the paper.

Statistical Analysis
To estimate the yield, a multiple linear regression (MLR) was applied. MLR is a wellknownmethod used to derive the relationship between a dependent variable and one or more independent variables. The following equation describes the MLR [27] where y is the predicted variable, x i (i = 1, 2, . . ., P) are the predictors, b 0 is called intercept (coordinate at origin), b i (i = 1, 2, . . ., P) is the coefficient on the i th predictor, and e is the error associated with the predictor.

Machine Learning Techniques 3.2.1 Back Propagation Neural Network (BPNN)
The neural network is a circuit of neurons, and the Backpropagation neural network comes under a supervised learning algorithm for training multilayer perceptron. In this model, eight neurons are in the input layer for eight input parameters. Further, random weights are initiated, and a bias value is added. At the hidden layer, three neurons are passed through the logistic regression activation function along with their weights and then reach the single neuron output layer. The BPNN tries to minimize the error function in weight space using the delta rule or gradient descent. The weights that minimize the error function to a global optimum are considered a solution to the learning problem [28].
The architecture of the BPNN model and the input parameters are given in Fig. 2 and Tab. 3, respectively. The neurons execute summation of all weighted inputs and determine the sum for activation function (f): O l = f wH n,l H n where H n denotes a hidden layer (subscript n represent a neuron); O l terms a neuron output; I m is the input; wI m,n and wH n,l are the weights of synaptic.  Then the hyperbolic tangential sigmoid function can be derived as follows: The linear transfer function can be expressed using the below equation that can be applied to the output layer.
The normalized equation needs to apply to force the data to be maintained between the defined ranges.
where Y N represent normalized value; x min and x max are the minimum and maximum range of data; y min and y max are −1 and 1, respectively.

Support Vector Machine (SVM)
Using Support Vector Machine aims to identify a hyperplane in an N-dimensional space to distinguish the data points. In Support Vector Regression, the margins are chosen to cover maximum data points leaving a few moments considered as slack variables. SVR is a very efficient algorithm because it is determined by the support vectors that cover the margin boundaries. Moreover, the SVR has a very efficient option to incorporate nonlinearity using the kernel trick. In our model, we used Radial basis function as the kernel function. The input parameters used for the model are derived in Tab. 4. The data samples are fitted concerning function fitting problems of the SVM;{x i , y i } , (i = 1, 2, . . . , n) , x i ∈ R n y i ∈ R with function f (x) = w × (x + b). According to SVM theory, the fitting problem can be derived as follows [28]: The ranges of a i, a * i , b are obtained through second optimization problems. Generally, a small portion of a i, a * i should not be zero and named as a support vector. Max: where, C is a constant that represent a penalty factor and indicates the penalty degree for excessive error; (x i x j ) is a kernel function. The following are the different types of Kernel functions at present: 1. Linear kernel: 2. Polynomial kernel: 3. Radial primary kernel function: 4. Two layers neural kernel:

General Regression Neural Network (GRNN)
General Regression neural network is an improved technique of RBF neural network which is more suitable for regression problems, particularly for dynamic systems like yield prediction. The architecture of the model is illustrated in Fig. 3. In this model, every data will represent a mean to a radial basis neuron. It has four layers: The input layer, hidden layer, summation layer, and the decision layer. GRNN is mathematically expressed as follows: The probability estimatorf (X , Y ) can be derived using the below equation based on the values of X I and Y i of the random variables x and y, respectively.
where n represents the number of sample observations; p denotes a vector variable x; σ terms the width of each sample. Then the scalar function D 2 can be derived as follows: The output layer consists of one neuron, which determines the output that yields the predicted output Y(x) to an unknown input vector x using the below formula: is an activation function.
The activation function is the weight of the input data. At this point, the unknown spread parameter is constant (σ ), and it can be adjusted by the training process to an optimum range where the error should be minimized. The training procedure is to determine the optimum of σ , and it varies between 0.0001 and 1. Therefore, the best practice is to minimize the MSE, and all normalized 100 data sets are divided into training and testing datasets as per the thumb rule. The network's training is carried out on 70% of data sets, and the remaining data sets were used to test and evaluate the network using as considered for the previous model.

Results and Discussions 4.1 Multi Linear Regression (MLR)
MLR model was developed based on the input-independent variables like Rice area, Rice production, rainfall, ET, Precipitation, temperature and fertilizers, and the output-dependent variable, the crop yield. The following equation represented the estimated output based on MLR: The paddy yield prediction of the MLR model is plotted between actual and predicted values in terms of kg/Ha (Fig. 4). It is noted that there is an inaccurate characteristic found between the yields. Further, the regression statistics illustrated in Tab. 5 show acceptable ranges i.e., multiple R, R2, and adjusted R and standard deviation are 0.910624, 0.8292236, 0.825516 388.8849, respectively.

Figure 4: MLR model
Considering the non-significance values of observed results from the MLR model, it is essential to demonstrate the machine learning models to precisely predict crop yield. Therefore, the following sections attempt various machine learning approaches for crop yield prediction.

Machine Learning Models
Further, for better visualization, different machine learning models such as back-propagation neural network (BPNN), Support Vector Machine (SVM), and General Regression Neural Network (GRNN) is demonstrated in a virtual platform that generates a graph between actual and predicted yield. The simulated plot for each model is given in Fig. 5.
From the observed images, it is perceived that the best fit of the three models shows better accuracy between actual and predicted yield. Among the three models, such as BPNN, SVM, and GRNN, the prediction curve best fits the actual yield precisely in the GRNN model. It can be ensured using the distributed dots in the plotted images.
Also, to make the potential yield more practical, conciseness, and readable, the time-series analysis model experiments for all the considered machine learning approaches. These models of representation clearly distinguish the predicted yield and the actual yield and show the validated samples separate from the training samples. The simulated results of each model are illustrated in Fig. 6.  As shown in the above figures, the time-series results show the prediction accuracy between actual and predicted values. It is observed that all the models show good accuracy; however, a GRNN model illustrates a more precise prediction among other approaches. It can be further ensured using evaluation metrics as described in the following section.

Evaluation Metrics for Machine Learning Models
The effectiveness of the machine learning models was gauged by using the following seven evaluation metrics. The values obtained by each model in these metrics are shown in Tab. 6.
The proportion of variance explained by model (R 2 ): In a regression problem, R 2 denotes the amount of deviation of the dependent variables explained by the independent variable.
It is considered that the R 2 value of MLR method as a benchmark, i.e., 0.82 and analyzed the same with the ML models and found the R 2 as 0.89, 0.93, and 0.97 for BPNN, SVM, and GRNN models, respectively. GRNN has the potential to explain 97% of variance from the input parameters towards the yield, thereby offering higher prediction accuracy.
Coefficient of variation (CV): It is a valuable tool to compare the results of two models and say which has more variance in relevance to its mean.
Coefficient of variation = (Standard Deviation/Mean) * 100 (19) In this work, CVs are observed as 0.08, 0.07, and 0.05 for BPNN, SVM, and GRNN models, respectively. BPNN shows more variance among these ranges, and GRNN has the least variance.
Normalized mean square error (NMSE): This metric is considered a practical test for model performance, overviewing the entire data set of samples unbiased towards over or under prediction.
The NMSE values of BPNN, SVM, and GRNN are found to be 0.11, 0.07, and 0.03, respectively. It is noticed that the error rate is very minimum for the GRNN model.
Maximum Error of Estimation: It points out the accuracy of the prediction, and it is defined as 50% of the width of a confidence interval. It is also called the margin of error. SVM has the least error estimate of 560.65 as it takes only the margin values (support vectors) under consideration; whereas, GRNN has a maximum error of 1031.02 because of the Euclidean distance of every sample is considered for each estimate. Root Mean Squared Error: It is the measure of how far the data points are spread around the best fit line. Statistically, it is the standard deviation of the residuals.
The RMSE value for BPNN, SVM, and GRNN is evaluated to be 296.07, 234.65, and 161.47, respectively. This metric shows that the predictions of the GRNN model are very close to the best fit line with an RMSE of 161.47 taken from 470 fields spread over the state of Tamilnadu.
Mean Absolute Error: Absolute error measures the magnitude of difference between the actual yield and predicted yield. MAE is the mean of the absolute error.
When MAPE value gets lower and further lower, it represents an arrival of a better fit line. Among the models, GRNN has a very low MAPE of 3.11, indicating a better fit compared with other models. From the obtained results of the machine learning models through the seven metrics, the following observations were noted: BPNN takes comparatively less time for analysis, but the deviation of the prediction from actual yield was more, and hence it is less efficient. The SVM has relatively more accuracy than BPNN, but it takes more time to train and validate the model. The GRNN analyses have the highest performance in predicting the crop yield in a diverse environment with R 2 of 0.97. Further, the run time analysis is carried out for all models; it is the time taken for the model to arrive at a better fit line. It is observed that BPNN has a less time of 24 μs, whereas SVM and GRNN take 60 and 4 ms, respectively.

Conclusions
Crop yield prediction plays a significant role in the agricultural sector that can be performed using statistical and machine learning algorithms. In this work, statistical models namely MLR and machine learning models such as BPNN, SVM, and GRNN models, are demonstrated for wide-area spectrum considering the Indian state of Tamilnadu. Seven different evaluation metrics are derived from warranting the reliability of the observed results. Based on the attained results, the following conclusions are made: Compared with the statistical model (MLR), ML models offered better accuracy between actual and predicted values, and the same was verified using time series analysis. GRNN model had a more significant potential to explain 97% of variance from the input parameters towards the crop yield; offered higher prediction accuracy.
BPNN showed more variance (CV), i.e., 0.08, and GRNN has the smallest variance scale of about 0.05. NMSE and RMSE were found to be least for the GRNN model, i.e., 0.03 and 161.47, respectively: most minor scale among other ML approaches. MAE and MAPE were observed best range for the GRNN model compared with other models, i.e., 82.74 and 3.11, respectively. The only limitation of the GRNN model was the run time. BPNN took just 24 μs, whereas GRNN took about and 4 ms.
Consolidating all the inferences, it can be concluded that the GRNN model is more suitable for crop yield prediction for a broad spectrum owing to its superior prediction accuracy.
Funding Statement: This study was supported by Suranaree University of Technology, Thailand.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.