Semantic Information Extraction from Multi-Corpora Using Deep Learning

: Information extraction plays a vital role in natural language processing, to extract named entities and events from unstructured data. Due to the exponential data growth in the agricultural sector, extracting significant information has become a challenging task. Though existing deep learning-based techniques have been applied in smart agriculture for crop cultivation, crop disease detection, weed removal, and yield production, still it is difficult to find the semantics between extracted information due to unswerving effects of weather, soil, pest, and fertilizer data. This paper consists of two parts. An initial phase, which proposes a data preprocessing technique for removal of ambiguity in input corpora, and the second phase proposes a novel deep learning-based long short-term memory with rectification in Adam optimizer and multilayer perceptron to find agricultural-based named entity recognition, events, and relations between them. The proposed algorithm has been trained and tested on four input corpora i.e., agriculture, weather, soil, and pest & fertilizers. The experimental results have been compared with existing techniques and it was observed that the proposed algorithm outperforms Weighted-SOM, RAO, PLR-DBN, KNN, and Naïve Bayes on standard parameters like accuracy, sensitivity, and specificity.


Introduction
The agricultural sector contributes a major share to the Indian economy and due to climatic changes, it is highly sensitive. For instance, some important factors like small landholdings, excessive dependence on fertilizers and monsoons, add more vulnerabilities in the Indian agricultural sector [1][2][3]. A large amount of unstructured agricultural data is underutilized due to the lack of data processing schemes. In developing countries like India, still, human experts, and government policies are the primary factors for decision-making. Factual validation based on current data is still mislaid from the perspective of policymaking [4].
In the last few decades, variability in climate has been affected in broad regions over agricultural sectors like agricultural water resource, crop growth and development, and crop production [5][6][7][8]. In the Indian subcontinent, the researchers study the climate-crop relationship based on long-term fertility, regional statistics, and other predictable field experiments that shows the yields of wheat and rice crop production model based on simulation methods [9]. The maximum land in Uttarakhand state is fertile but due to land subdivision problems, the farmers consider the agriculture sector as an infeasible source for gaining food security. The major crops of Uttarakhand are maize and rice known as Kharif/monsoon crops. The Kharif crop production is very less in the Uttarakhand region when compared to other regions because of environmental conditions like the constant threat of landslides, high rates of erosion, and landslides during rains. Crop production is completely dependent on rain-based agricultural land. In the Uttarakhand state, almost 80% of agricultural production is based on rain-fed-based agriculture [10]. The individual growth in diverse agroecosystems with different hydro-geological regions and the diversity in crops and cropping techniques define a high resilience system. The traditional crop rotations and practices followed also help in maintaining the diversity which may vary with irrigation conditions, altitude, soil type, moisture regime, local knowledge, and direction and degree of slope [11].
For a suitable crop, the weather is not the only essential component, soil and fertilizers are also equally contributing. However, the current machine learning methods such as Bayesian networks, Gaussian kernel-based support vector machines (SVM), and artificial neural networks (ANN) are unable to identify the suitable soil and pest and fertilizer for the selected soil [12,13]. Soil quality depends on Electrical Conductivity, pH level, macronutrients, and micronutrients of the selected crop [14]. These soil quality indices help the farmers to select the appropriate pest and fertilizer for the better yield of the selected crop.
As per Fig. 1, inputs can be domain-dependent or independent unstructured/semi-structured corpus (or corpora), domain-specific knowledge, and user-specified extraction patterns [15][16][17]. The information extraction (IE) engine processes the input data to extract knowledge and save it into a structured database (relational and graph databases). The researchers have proposed very limited empirical and soft computing techniques for the prediction of rainfall for crop productivity along with the appropriate land details of the Uttarakhand region. The proposed work bridges this gap by extracting the semantics between extracted named entity recognition (NER) and events from unstructured agricultural text with a focus on the Uttarakhand region [18]. The major contributions of the present research work are: (1) A novel deep learning technique for semantic information extraction using four input corpora (agriculture, weather, soil, and pest & fertilizer) was proposed. The proposed deep learning technique uses long short-term memory (LSTM) with two classifiers i.e., rectification of Adam optimizer and multilayer perceptron (MLP). (2) To remove the noise from input corpora, a new word sense disambiguation (WSD) algorithm was introduced. (3) The proposed technique is able to predict the increase in crop intensity, crop yields, and the resulting increase in the employment of the Uttarakhand region.
The remaining sections of current research work are as follows. Section 2 shows the survey of recent technologies applied for IE to improve crop productivity. Section 3 depicts the proposed methodology as well as the WSD algorithm.. The experimental results, discussion, and validation of the proposed method are reported in Section 4. Section 5 concludes the paper and discusses future work directions.

Literature Survey
For the last two decades, machine and deep learning techniques have made a large contribution in handling the information extraction problem from various application areas including medical image analysis and retrieval [19][20][21][22][23], biometrics recognition [24][25][26], disease diagnosis [27,28], agriculture, etc. The following literature study shows the related work on the agricultural sector using machine learning techniques.
Nair et al. [29] have exhibited ANN in the Global Climate Model (GCM) in India. The goal of the proposed method was to anticipate the Indian Summer Monsoon Rainfall esteems utilizing precipitation yields from GCM. The ANN procedure was connected to different ensemble entities from the GCMs individual to get month-wise scale expectations for India and its sub-divisional region. In the present investigation, straight-forward randomization and double folded approval method were used to minimize over-fitting problems while training the ANN method. The ANN anticipated rainfall is executed from GCMs individuals and decided by examining the absolute error, box plots, contrast, and percentile in linear error in probability sample space. Experimental results proposed the critical changes after applying the ANN system of these GCMs individuals in forecast expertise. The datasets depend on the past estimations of the primary variable however not on logical factors which may influence the framework/variable. Satir et al. [30] proposed a Stepwise Linear Regression and vegetation indices method for crop yield estimation. By applying object-based classification and multi-temporal land-sat data set, mapping was formed on related crop patterns of an area. In this scenario, by applying realtime measurement methods like Mean Percent Error (MPE) prediction was estimated. MPE was estimated for cotton, corn & wheat and combined with soil salinity degrees. Based on weather data forecasting was done and prediction of accuracy was reduced based on a single parameter.
Das et al. [31] investigated the hybrid algorithms such as Least Absolute Shrinkage and Selection Operator (LASSO), ANN, penalized regression models consists of the elastic net (ENET), Principal Components Analysis (PCA), and Stepwise Multiple-Linear Regression (SMLR) for predicting the yield of rice with the help of long-term weather data. The experimental results stated that LASSO-ENET provided good performance because these methods reduced the model complexity and prevented overfitting by using magnitude coefficients. The pairwise multiple comparison test found that the hybrid models were utilized very well for the prediction of the crop on the west coast of India. But, the combination of feature selection methods and feature extraction with neural network include PCA-SMLR provides poor performance because that PCA did not include the dependent variable while alteration of input variables.
He et al. [32] implemented a Hybrid Wavelet-based Neural Network (HWNN) which included Particle Swarm Optimization (PSO), Mutual Information, and Multi-Resolution Analysis into ANN for predicting rainfall from antecedent climate indices and monthly rainfall. The Maximal Overlap Discrete Wavelet Transform decomposed the large-scale climate indices and standardized monthly rainfall anomaly into subseries components with various time scales. The PSO algorithm was applied to find the optimal neuron numbers in ANN's layers (hidden) and the predictor (selected) predicted anomaly sub-series for each rainfall. HWNN method was more efficient for particular season rainfall prediction but took high prediction time in different season rainfall prediction.
Mohan et al. [33] implemented parallel layer regression with Deep Belief Network (PLR-DBN) for the estimation of food crop productivity using factors such as season types, soil type, risk factor, and water availability. The proposed PLR-DBN method targeted five crops in Karnataka based on accuracy, sensitivity, and specificity. Talukder et al. [34] designed a prediction and recommendation technique that determines food crop productivity based on temperature, rainfall, and humidity parameters. K-nearest neighbor (KNN), random forest, SVM, logistic regression, Naïve Bayes classifier were used for the prediction model. Collaborative and multi-condition filtering techniques are used for the recommendation system.
To improve the overall crop productivity, this paper developed a deep learning-based method for the Uttarakhand data, weather data from Indian Metrological Department (IMD), Dehradun whereas the soil, and pest and fertilizer corpora are open source databases.

Study Area and Dataset Description
Uttarakhand is a state in the northern part of India that spreads from 79 • 15' east longitude to 30 • 15' north latitude with 53,483 square km geographical area. This state was taken as the area of study for our research work. The Uttarakhand state i.e., the Garhwal region comprises Chamoli, Dehradun, Pauri, Uttarkashi, Rudraprayag, Tehri, Haridwar, and the Kumaon region with Almora, Bageshwar, Nainital, Pithoragarh, Champawat and Udham Singh Nagar districts. For modeling rainfall-runoff events, the entire region has been considered for the study so that almost the whole state area can be covered. Data from various data sources like IMD, soil, and pest & fertilizer corpora were gathered from various research organizations such as District Soil Testing Laboratory, Dehradun/Soil Testing Laboratories located at Nanda ki Chowki, Premises of Directorate of Agriculture, Premnagar, Dehradun, and a database has been created.

Proposed Methodology
The next five subsections include the proposed framework, min-max algorithm applied for data preprocessing, corpora concatenation techniques, proposed WSD algorithm, and deep learning-based IE algorithm.

Proposed Framework for Semantic IE
The research framework presents a theoretical and practical approach for extracting semantic information. Unlike few existing frameworks in literature, this approach attempts to give a structure that highlights the fundamental concepts and components of semantic IE. The methodology followed in this study is composed of a collection of articles in the selected areas, collection of authenticating data in those relevant fields (mostly the benchmark datasets from repositories) selection of appropriate data mining tools, data storage tools (Excel, Oracle), and editing tools.
In this section, the operational framework is elaborated for presenting the complete flow of the research components carried out for this study. This study mainly spins around information gathering, data pre-processing, semantic extraction, and data post-processing. These four core or concentrated parts are involved in the practical implementations of this framework. Fig. 2 shows the overall view of the present research, wherein the framework has been divided into four different modules including corpus concatenation, deep network-based NER and EE, and Semantic Extraction.

Min-Max Algorithm
This subsection is used for preprocessing the input corpora i.e., removal of noises and identifying missing values. The input data are taken from the database and data consists of different kinds of units like temperature in celsius, wind speed in miles per hour, etc. In deep learning architecture, to avoid the scaling effects, normalized variables between intervals [0-1] has been used in the proposed method. The normalization method applied to the dataset can be observed in Eq. (1), where a i denotes a normalized value for the i th variable, for this variable min a i denotes the minimum value registered in the training dataset and for the same variable, the maximum value in the training dataset represented by max a i .

Corpora Concatenation
The previous subsection used a min-max algorithm to minimize the noises from input corpora and these unstructured input corpora are converted into a single unified corpus. As the nature of these corpora is different, merging the corpora required maximum human intervention. Knowledge-based WSD (KB-WSD) and corpus-based WSD (CB-WSD) are two basic methods used for combining two or more corpora into a single entity. The main purpose behind the integration of these two popular algorithms was to remove semantic ambiguity and merge different natured corpora into a single entity i.e., called Agri_Corpus. The integration of CB-WSD into a KB-WSD successfully has shown a low improvement rate in many cases. So, for the current research work, the integration of KB-WSD into a CB-WSD has been used in the proposed model and the same will be discussed in the next section.
The agricultural data has been collected from Krishi Vigyan Kendra, Dhakrani, District Dehradun, Uttarakhand. (http://agricoop.nic.in/sites/default/files/UKD7-Dehradun-10.07.14.pdf ). Tab. 1 shows the sample data for rainfall prediction to improve the crop productivity of the Uttarakhand region. Tab. 1 describes the sample data collected for major crop productivity of Uttarakhand like rice, barley, and potato. The rice crop gives more productivity like 19689.9 kg per hectares (ha) during rainfall season, whereas barley gives nearly 20 kg per ha for the winter season. Rice has been considered the most important crop for Uttarakhand because of its productivity. Potato can be cultivated during the summer season, which has productivity of 22140 kg per ha. In the Uttarakhand region, the crops like rice, barley, and potatoes are majorly sown at low, medium, and high rainfall respectively. Tab. 2 shows the sample data for monthly average rainfall data of one year. The data collected for nearly 20 years are taken from the region, the rainfall values can be calculated by using the predicted values from the sample table. The values 0 in the predicted column indicate low rainfall, whereas 1 indicates medium rainfall, and 2 represents high rainfall.

Proposed Disambiguation Algorithm
Before applying the natural language processing technique, input data should be processed using a disambiguation algorithm. Few existing methods can be used to extract the sense of ambiguous words from an unstructured text [35]. The proposed algorithm was used to extract the sense of ambiguous words present in the corpus collected for current research. Cosine similarity has been used to measure the similarity between two words and cosine distance to find the similarity distance between two words [36].
Eqs. (2) and (3) represent the cosine similarity and cosine distance between two words (W i and S i ) and have been defined as Sim(W i , S i ) and D_amb (W i , S i ) respectively. The range of cosine distance is between 0 to 1, where 1 represents W i and S i are different in nature and 0 (≈0) represents that W i is associated with S i [37].

Proposed Algorithm for Semantic IE Using LSTM-RAO and MLP
The following algorithm uses the min-max algorithm for data normalization, which has the advantages of the LSTM techniques. For each iteration of backpropagation, an RAO and MLP are applied to modify the weights in a deep network. This optimizer has inherited the properties of RMSProp and AdaGard optimizer. while

//Compute the length of the approximated SMA
(Continued) f. if the variance is tractable, i.e., ρ t > 4 then //Compute bias-corrected moving 2nd moment

LSTM with RAO and MLP
This section represents the algorithm of the proposed deep learning technique comprising of the following architecture (i.e., an LSTM), which has been used as a feature selection method and responsible for treatment in time series. Whereas, MLP network and rectified Adam optimizer (RAO) have been used for classification as well as prediction tasks. Fig. 3 represents the LSTM with RAO and MLP based deep learning network [38]. The proposed model can be divided into two parts namely feature selection and classification. In this network, hyperbolic tangent transfer (tansig) activation function has been used in the deep hidden layers and sigmoid (sig) activation function has been used to increase the correlation within the target data. The following activation functions are used in the hidden layers and have been stated in Eq. (4).
As per Eq. (5), each number in the cell state C t-1 , the f t = [0-1] In the forget gate, W fg and b fg represent the weight, and bias of the forget gate. From input X t , the sigmoid layer and tanh layer have been used to store, update, and decide the cell state. In Eq. (6), the updated information should either ignore or get updated based on the value of the sigmoid function (0,1) and (−1 to 1) of tanh function decides the importance level in Eq. (7). Multiplication of N T and i T has been performed to update the new cell state in the LSTM network. This new memory cell value is then added to the last memory value i.e., C T−1 to find an updated C T as shown in Eq. (8) In the next step, the output value (h T ) is derived from the output (O T ) of the cell state. In Eq. (9) sigmoid function picks that cell state which takes part in the output, then the sigmoid gate output (O T ) is multiplied by the new cell state (C T ) values and h T is used for tanh layer [−1 to 1] in Eq. (10).
Like RMSprop and Adadelta, Adam optimizer can be used to save an exponentially decaying average of the last gradient M T and squared gradient (V T ) in Eqs. (11)- (13). At time T, the stochastic object for finding gradients (G T ) is: Here M T and V T represent the 1 st and 2 nd gradient moment that is the mean and uncentered variance. The biasing ≈ zero have been noticed especially during the initial time T and ε 1 & ε 2 (small delay) ≈ 1. Eqs. (14) and (15) are used to calculate the biases offset which are defined by evaluating the bias-corrected first and second-moment estimates.

Performance Evaluation
For the scenario experimental simulation, Python Jupyter notebook was installed in the computer system with a 3.2 GHz Core i5 processor. The proposed WSD algorithm has been applied to the following small paragraph (next complete paragraph). For a single word, there are various meanings (sense). To demonstrate the proposed algorithm, the following paragraph was used as input "Ginger is a medicinal plant. There is not a particular period to sow this plant but the pre-monsoon shower session is considered a better period. It is considered a Kharif crop. One month of dry weather before harvesting ginger gives better results".
Tab. 3 represented the output in a tabular form. The term "session" is related to the period of activity, a serious meeting, and a weather session. By applying the disambiguation algorithm, the word 'session' was related to a weather session only. The output of the proposed algorithm has been presented in tabular form.    6 shows the other output of the proposed deep learning-based agricultural-based event extraction. In the next phase, the proposed deep learning method has been applied on the unstructured unified corpus to extract agricultural-based NER, events, and relationship that can be used to predict the major crop productivity in the Uttarakhand region. For finding better crop production, the main factors like soil, season, water, input support facilities, and risk were used. Some other observations include Mean Squared Error was 0.065, Root Mean Squared Error was 0.25, Mean Absolute Error was 0.065, and Nash-Sutcliffe efficiency coefficient was 0.99.

Parameter Metrics
In this study, the performance of the proposed method was assessed using standard statistical performance evaluation criteria which included the accuracy, sensitivity, specificity, and F-Measures. The following Tab. 4 provides the value for accuracy, F-measure, sensitivity, specificity for major crops of the Uttarakhand Region.

Comparative Analysis
This section provides a detailed description of the performance of the proposed method. The comparison of the proposed method has been presented with the cross-validation of 80% training and 20% testing data. The cross-validation of the proposed method was also analyzed for 70-30% and 60-40% training-testing data. Fig. 7 shows the accuracy of the proposed method with respect to ANN, recurrent neural network (RNN), LSTM with Adam optimizer, and LSTM with rectified Adam optimizer.
Similarly, Fig. 8 shows the comparison of the proposed method with respect to ANN, RNN, LSTM with Adam optimizer, and LSTM with rectified Adam optimizer in terms of precision, recall, and F-score parameters.    As mentioned in Tab. 5, the proposed method with existing techniques such as deep learningbased weighted self-organizing map (DL-SOM) [39], LSTM+RAO [40], PLR-DBN, KNN, and Naïve Bayes techniques were evaluated in the combinations of testing and training percentages like 80% training and 20% testing dataset for rice, barley, and potato.

Conclusion and Future Directions
The proposed methods have presented a statistical investigation of the rainfall, soil, agriculture, and pest and fertilizer dataset for the Uttarakhand region. The scope of the proposed experiment was to extract the agricultural-based NER, events, and the relationship between them. The stated method can be used to enhance the productivity of the major crops like rice, barley, and potato in high rainfall areas of Uttarakhand state by investigating the accurate rainfall required for a good quantity of crop prediction with better soil quality. In this context, a deep learning method was implemented to predict the suitable major crop for the season in Uttarakhand Region, India. The output thus generated using the introduced method shows a better performance than existing methods. An accuracy of 88.10% was achieved by properly utilizing the LSTM with RAO and MLP optimizers. The experimental results were compared with the DL-SOM, LSTM+RAO, PLR-DBN, KNN & Naïve Bayes and it was observed that the proposed algorithm outperforms the existing ones with 1.09%, 1.32%, 1.0%, 1.37% and 1.22 in terms of accuracy, 1.09%, 1.01%, 1.0%, 1.44% and 1.44% on sensitivity, and 1.11%, 1.0%, 1.07, 1.41 & 1.49% on specificity as compared to DL-SOM, LSTM+RAO, PLR-DBN, KNN and Naïve Bayes respectively. The value of the Nash-Sutcliffe efficiency coefficient was 0.99. The advanced scheme delivered an effective performance in the form of improved sensitivity, accuracy, specificity, and F-score than the previous methods related to the other approaches available for crop prediction. To improve agriculture productivity plant deceases dataset can take into consideration for future work. The experimental results show that there is a huge scope for researchers to focus on potato crops productivity in hilly areas.