The rise of social networking enables the development of multilingual Internet-accessible digital documents in several languages. The digital document needs to be evaluated physically through the Cross-Language Text Summarization (CLTS) involved in the disparate and generation of the source documents. Cross-language document processing is involved in the generation of documents from disparate language sources toward targeted documents. The digital documents need to be processed with the contextual semantic data with the decoding scheme. This paper presented a multilingual cross-language processing of the documents with the abstractive and summarising of the documents. The proposed model is represented as the Hidden Markov Model LSTM Reinforcement Learning (HMMlstmRL). First, the developed model uses the Hidden Markov model for the computation of keywords in the cross-language words for the clustering. In the second stage, bi-directional long-short-term memory networks are used for key word extraction in the cross-language process. Finally, the proposed HMMlstmRL uses the voting concept in reinforcement learning for the identification and extraction of the keywords. The performance of the proposed HMMlstmRL is 2% better than that of the conventional bi-direction LSTM model.
Natural Language Processing (NLP) is an effective platform on computers for the efficient functioning of certain tasks in human languages. It involves the processing of the input provided by the human language in the conversion of the input information into an appropriate representation of the information in another language [
India comprises of diverse languages around 2000 dialiets are identified with stipulated use of Hindi and English for official communication with the national government. India has two official languages used by the national government, as well as 22 scheduled languages for administrative purposes [
CLIR is involved in the processing of the languages based on the query languages with other languages for the searched documents. This involved estimation of the information availability in the native languages within CLIR and relevant information about documents being identified [
To improve the functionality of the CLIR, data mining is an effective tool for information processing [
In search of the World Wide Web, English is considered a primary language, with an increase in the number of users. Non-English native speakers also search for documents. To facilitate the ease of search for people, it is necessary to construct an appropriate domain with machine learning for identification and translation of the cross-language abstraction and summation model. The proposed model HMMlstmRL comprises the LSTM model with HMM integrated with reinforcement learning. The specific contribution is presented as follows:
With the Hidden Markov model, word count and number of keywords count are considered for analysis and processing. The LSTM model calculates the total number of words in the statement or document. Through a calculated number of words, the number of repeated words is updated in the neural network model. To develop a bi-directional LSTM-based corpus model for multi-language encoding processing with keyword extraction. To construct a reinforcement learning based machine learning model for feature extraction and word identification. Upon the testing and validation of the information, words are processed and updated on the network. Within reinforcement learning, the MapReduce framework is applied for the clustering and removal of the same words in the text document. Finally, voting is integrated for the abstraction and summarization of the keywords. The experimental analysis showed that the proposed HMMlstmRL achieves higher precision and recall value compared with the conventional techniques.
This paper is organised as follows: Section 2 investigates how the cross-lingual processing model works. In Section 3, research methods are adopted for the encoding and decoding of data in multi-lingual systems with the proposed HMMlstmRL. In Section 4, experimental analysis of the proposed HMMlstmRL model is comparatively examined with existing techniques. Finally, in Section 5, the overall conclusion is presented.
The key challenge for English to Hindi statistical machine translation is that the Hindi language is richer in morphology than the English language. There are two strategies that facilitate reasonable performance in this language pair. Firstly, reordering of English source sentences in accordance with the Hindi language is the second strategy, which is by making use of suffixes of Hindi words. Either of these strategies or both strategies can be used during translation. The difference in word order between the Indian language and English makes these two strategies equally challenging. For example, the English sentence “he went to the office” has the Hindi translation “vah (he) kaaryaalay (office) gaya (went to).” In this example, it is evident that the position of words in English is not retained in the translated Hindi text. An author has developed an unsupervised part-of-speech tagger that makes use of target language information, and it has been proven that the results are better as compared with the Baum-Welch algorithm [
In [
In [
In [
In [
In [
In [
In [
The proposed HMMlstmRL comprises of the coding scheme applied with the Hidden Markov Model for abstraction and summation. Initially, the character is estimated with HMMlstmRL perform coding scheme for the estimation of the stored characters in the computer in the form of bits. The detection of the charset is based on estimation of the Unicode transformation (UTF – 8) due to presence of larger bytes in the sequences. The bytes are computed based on the implementation of the bytes with validity test through
Accommodates more than 65,000.
Synchronized with the corresponding versions of ISO-10646.
Standards incorporated under Unicode
ISO 6937, ISO 8859 series
ISCII, KS C 5601, JIS X 0209, JIS X 0212, GB 2312, and CNS 11643 etc.
Scripts and Characters, European alphabetic scripts
The text processing system comprises of the dependent components based on the dictionaries in the targeted language document with the assigned character codes. The each elements code is defined as Unique number code point with the hexadecimal prefix number of U“Ex”, U+0041 with value of “A”. The language character level is identified and processed with the statistical language identification for the training data in reinforcement learning [
Hidden Markov model has fewer assumptions of independence. In particular, HMM does not assume that probability that condemnation i is in summary is independent of whether condemnation i−1 is in summary. The syntactic and semantic features are extracted from the source language text and are used in the transfer phase to generate the sentence in the target language. This information is extracted by the HMM source language analysis phase. This phase is further sub-divided into–part-of-speech tagging and word sense disambiguation. Once this information is extracted, it is used in the transfer phase which makes use of Bayesian approach. This sense-based machine translation system makes use of Bayesian approach, which is based on the statistical analysis of existing bilingual parallel corpora. The label assigned with the HMM model is defined as follows based on features.
Feature Name | Label | Value |
---|---|---|
Position of Paragraph | O1 | 1,2,3 |
Number of terms | O2 | log(wi+1) |
Baseline Term Probability | O3 | log(Pr(terms in i|baseline)) |
Document Term Probability | O4 | log(Pr(terms in i|document)) |
The position of condemnation in its paragraph. We assign each condemnation value o1(i) designating it as rest in paragraph (value 1), last in paragraph (value 3), or intermediate condemnation (value 2) condemnation in one-long paragraph is assigned value 1, & condemnation s in two-long paragraph are assigned values of 1 & 3. A number of terms in condemnation. value of this feature is computed as o2(i) = log (number of terms + 1).
The content processing with the HMMlstmRL with application of the HMM is defined as
The position of condemnation is incorporated with state-structure of HMM. Estimation of this component is o1(i) = log (number of terms + 1) Likely condemnation terms are, given report terms o2(i) = log (P r (terms in condemnation i|D)).
The statistical analysis is also performed based on the local syntactic information available in the input sentence. Using the analysed data, the Bayesian approach is applied to predict the probable target word for the given input word. This proposed word sense-based statistical machine translation system may be mathematically expressed as in
In the above
Since the proposed machine translation system works at word level, there is need for tokenization of source text at sentence level as well as word level. In Hindi language, there is tense, aspect and modality (TAM) information stored in the affixes of the words. These affixes also contribute to the accuracy of machine translation. To extract the TAM information stored in the affixes, longest affix matching algorithm is used to check the matching between affixes. Levenshtein distance is used to calculate the matching score. For example, consider the word
In the LSTM model, the probability value of the document drivers is calculated from the existing languages. The probability is computed based on the occurrence of the string value as S with the alphabet X sequence is represented in
With the predefined categories for n values differentiate keywords are assigned with category
The distribution of parameters is based on the consideration of the different variable n with the trails number and probability of occurrence of character in the document Unigram as p. The multilingual corpus texts for the characters are understand with the character set defined as follows:
The hidden markov model-based MapReduce tagger is a statistical approach which is used to identify the probable part-of-speech of each word in the sentence. The hidden markov model-based tagger basically finds the most probable sequence of part-of-speech for a given sentence by using the transition probability
The Input word sequences are denoted as W1, W2, W3, …, n and the part-of-speech of each word is denoted as T1, T2, T3, …, Tn. The part-of-speech of each word T1, T2, T3, …, Tn acts as hidden states. Each of these hidden states is predicted using the emission and transition probabilities. For example, consider the input sentence as
The frequency of occurrence count is found from the monolingual Hindi corpus. Consider there are 35 occurrence of word
In similar manner as in
Hindi Text:
English Equivalent: Drinking water is good for health.
Tamil Text-1:
Tamil Text-2:
Tamil Text-3:
To disambiguate the appropriate sense of the word mentioned in source text, a word sense disambiguation is used.
The identified senses are used in the proposed statistical machine translation approach. The transfer phase basically predicts the probable target language word based on the source language word and its part-of-speech. Using Bayes rule, the transfer phase is mathematically expressed as below in
The target word (
In this proposed HMMlstmRL model, the part-of speech (pos) of the source text is also considered as a parameter to predict the probable alignment for a word. Thus, the probable position of target word can be calculated using its conditional dependence with position of source word, length of source text, length of target text and part-of speech of source word. Using Bayes theorem, the modified word alignment model is mathematically represented in below
In the above
The hidden layer in the MapReduce tagger is activated using a sigmoid function which is expressed mathematically as below,
The abstraction and summation phase of the proposed HMMlstmRL is estimated with reinforcement learning defined as in
Using extended Bayes theorem, the above expression (20) is rewritten as,
The HMMlstmRL approach for Hindi to Tamil machine translation is compared with the naïve Bayes statistical machine translation system in terms of the features that are being used in both the system. The term-document frequency matrix is constructed using all the above sentences [S1, S2, …, S6] along with the input sentence. Using these 7 sentences, the number of distinct words is identified as 39. Thus, the term-document frequency matrix (A) will be of size (39x7) and is as shown below,
The cosine similarity between vectors in right singular matrix and first row of singular diagonal matrix is calculated. The resultant vector after applying cosine similarity is as below,
The sentence which has the cosine similarity nearer to 0 is closest to the input sentence. The first value in the vector denotes the cosine similarity with the input sentence itself and it is natural that it will be closest to zero. The next smallest value in the vector is 0.3341 which is the cosine similarity value of sentence S1.
This paper concentrated on the evaluation of language keywords through the implementation of a stacked classifier integrated with a voting scheme. This research utilizes machine learning for language estimation and classification. This research considers four classifiers such as AdaBoost, Artificial neural networks (ANN), Decision tree, and Support Vector Machine (SVM) integrated with the voting scheme. The developed HMMlstmRL is adopted through machine learning, which involves several steps those are input data, data pre-processing, feature extraction, feature selection, and classification. Based on the proposed HMMlstmRL classification of language classification. The steps implemented in the proposed mechanism are presented as follows:
In the proposed HMMlstmRL mechanism collected data features are processed for machine learning stages. The proposed HMMlstmRL scheme uses 4 classifiers with stacking in machine learning. The 4 classifiers considered are AdaBoost, ANN, decision tree, and SVM in the machine learning process in HMMlstmRL. The proposed HMMlstmRL is involved in the classification of attacks. Initially, language classification is performed with consideration of 4 classifiers such as AdaBoost, SVM, decision, and ANN. With classification, if it is identified as an attack for 2 classifiers and not attack as another classifier then the proposed HMMlstmRL evaluates with a decision tree. The decision tree algorithm is involved in the decision-making of the language whether the file is an attack or not. Based on the classification results provided by the decision tree network system will estimate which language it belongs. The classification of the language is based on the voting mechanism.
Initially, the proposed HMMlstmRL train the model through classifiers such as AdaBoost, ANN, SVM, and decision tree. The model trained through the classifier is exported to the predictor for the computation of language. The predictors evaluate the predictor vector of each classifier and update to voting. The classification voting approach is based on the score value of the classifier; if the value obtained from the four classifiers is computed as a language, then the network is considered another language keywords. In case if two classifiers are stated as a keyword for the language then the proposed HMMlstmRL goes with the decision tree process for computation. To make a decision proposed HMMlstmRL utilizes a voting mechanism. The voting scheme is utilized for the estimation of languages. The analysis is based on the summation of classifier value for computation of languages belongs to keyword or not. The voting scheme is estimated based on the score of the classifier through consideration of condition such as:
if score > 2; then keyword language
else 0; other language
Based on the computation of classifier value voting score is computed. Through computed voting score proposed HMMlstmRL compute the data is belongs to particular language or not.
The proposed HMMlstmRL comprises the sequence-to-sequence model and was developed using pytorch with TensorFlow at the backend. TensorFlow is an open-source platform used for developing deep neural networks. The proposed model learns the features from the input vector and target vector. These features are used to generate the target text based on the input vector fed to it. To make the model learn the features in an efficient way, there is a need for a huge amount of corpus. Due to this feature, a sentence in Tamil and Hindi can be shuffled in different combinations to generate variants of the given sentence. Since the Hindi language is a partially free word order language, all the combinations generated will not be grammatically correct. Thus, there is a need for verifying the grammatical correctness of the text being generated. Parsing the sentences will be helpful in checking the grammatical correctness of them. In this proposed approach, HMMlstmRL is used to verify the correctness of the generated Hindi sentence. The Hindi parser verifies the grammar by parsing the tagged text fed to it. In this way, the valid variants of Hindi text are generated along with its Tamil sentences and are maintained in the training dataset. Similarly, the Tamil sentences can also be shuffled, but there is no necessity for verification of grammar in them. This is due to the fully free word order nature of the language.
The proposed HMMlstmRL model was analysed with various dropout percentages and an optimal percentage value was found to be in the range of 20% to 60%. Since there is an encoder module and a decoder module, there is a need to analyse the dropout percentage in both these modules such that the performance of the overall system is good. The ideal dropout percentage for encoders is found to be 20%, and the ideal dropout percentage for decoders is 60%. The following are parameters that were used for the sequence-to-sequence model,
Number of epochs = 22
Learning rate = 0.01
Hidden layer size = 2
Dropout = 0.2 (in encoder) and 0.6 (in decoder)
After 22 epochs, the feature learning by model gets saturated and
For the analysis taking an example from our published word [61], consider the input text as:
Tag | JJ | N_NN | CC_CCD | PR_PRP |
---|---|---|---|---|
Part-Of-Speech | Adjective | Noun | Conjunction | Pronoun |
Tag | PSP | V_VM> | V_VAUX | RD_PUNC |
Part-Of-Speech | Postposition | Finite Verb | Auxiliary Verb | Punctuation |
The tagged output for the text considered will be as below,
The proposed HMMlstmRL compute sequence to sequence model was tested with various test set. The generated target sentences were evaluated using Bilingual evaluation understudy (BLEU) score. The
Training/Testing Corpus Size (in %) | BELU Score |
---|---|
60/40 | 0.7037 |
70/30 | 0.7234 |
80/20 | 0.7588 |
90/10 | 0.6628 |
The neural machine translation system is also evaluated at different runs by keeping the ratio of training and testing pair as 80:20 as shown in
Number of Runs | BELU score |
---|---|
1 | 0.7478 |
2 | 0.7176 |
3 | 0.6784 |
4 | 0.7118 |
5 | 0.7588 |
6 | 0.7124 |
7 | 0.7022 |
8 | 0.6914 |
9 | 0.7156 |
10 | 0.7211 |
The word embedding is performed using a continuous bag-of-words model and it is found to capture the semantics in the words. This in turn helped in improving the accuracy of the translation using the sequence-to-sequence model. Since Hindi and Tamil language are morphologically rich, there is need for semantic mapping which is made using this approach. The results are found to be far better than any state-of-art method for these two languages. It is found that BLEU score is 0.7588 and it can be improved further by using a properly aligned parallel corpora as shown in
Corpus Size (In number of words) | HMMlstmRL | Without HMMlstmRL | ||
---|---|---|---|---|
Precision (in %) | Recall (in %) | Precision (in %) | Recall (in %) | |
10000 | 74 | 77 | 56 | 51 |
20000 | 83 | 86 | 70 | 73 |
30000 | 91 | 89 | 79 | 82 |
The target word is thus predicted to be “
The generated sentence is compared with the reference sentence and is as shown in
S.No. | Corpus Size |
Bi-directional LSTM | Proposed HMMlstmRL | ||
---|---|---|---|---|---|
Precision (in %) | Recall (in %) | Precision (in %) | Recall(in%) | ||
1 | 10000 | 53 | 52 | 58 | 57 |
2 | 20000 | 64 | 67 | 71 | 72 |
3 | 30000 | 76 | 74.5 | 83 | 79 |
The performance of the bi-directional LSTM seems to be poor as compared with the statistical machine translation without pivot language which is evident from the graph shown in
Based on the above calculation, it is found that the word
From
S. No. | Corpus Size (In number of words) | Bi-directional HMMlstmRL | Proposed HMMlstmRL | ||
---|---|---|---|---|---|
Precision (in%) | Recall (in%) | Precision (in%) | Recall (in %) | ||
1 | 10000 | 68 | 58 | 65 | 63 |
2 | 20000 | 70 | 69 | 76 | 72.5 |
3 | 30000 | 81 | 76 | 87 | 86.5 |
S. No. | Methodology | Source (L) | Target (L) | BLEU Score |
---|---|---|---|---|
1 | HMM [ |
English | Part of Speech | 0.2287 |
2 | Topic-based coherence model [ |
Indonesian | Japanese | 0.1723 |
3 | Proposed bi-directional HMMlstmRL | Hindi | Tamil | 0.7394 |
4 | Proposed HMMlstmRL | Hindi | Tamil | 0.7637 |
In today's multicultural business world, there is an increased use of natural languages for establishing communication in the business environment. As a result of globalisation, machine translation systems are required to aid in communication between various organisations. Thus, the demand for translation systems into various language pairs has increased. It can also help in improving communication between people of different origins. This paper presented a HMMlstmRL model HMM integrated with bi-directional LSTM with machine learning technique with the MapReduce framework for clustering. Finally, the proposed HMMlstmRL model uses voting for language whether it belongs to a keyword or not. The analysis of the results showed that the proposed model exhibits higher performance than the existing techniques.
The authors would like to thank the research and development departments of B V Raju Institute of Technology, Vasavi College of Engineering, C.B.I.T and K.S.R.M College of Engineering for supporting this work.