Open Access
ARTICLE
Corpus Augmentation for Improving Neural Machine Translation
Zijian Li1, Chengying Chi1, *, Yunyun Zhan2, *
1 University of Science and Technology Liaoning, Anshan, 114031, China.
2 College of Science & Health, Technological University Dublin, Dublin, D08 X622, Ireland.
* Corresponding Author: Chengying Chi. Email: ;
Computers, Materials & Continua 2020, 64(1), 637-650. https://doi.org/10.32604/cmc.2020.010265
Received 21 February 2020; Accepted 11 April 2020; Issue published 20 May 2020
Abstract
The translation quality of neural machine translation (NMT) systems depends
largely on the quality of large-scale bilingual parallel corpora available. Research shows
that under the condition of limited resources, the performance of NMT is greatly reduced,
and a large amount of high-quality bilingual parallel data is needed to train a competitive
translation model. However, not all languages have large-scale and high-quality bilingual
corpus resources available. In these cases, improving the quality of the corpora has
become the main focus to increase the accuracy of the NMT results. This paper proposes
a new method to improve the quality of data by using data cleaning, data expansion, and
other measures to expand the data at the word and sentence-level, thus improving the
richness of the bilingual data. The long short-term memory (LSTM) language model is
also used to ensure the smoothness of sentence construction in the process of sentence
construction. At the same time, it uses a variety of processing methods to improve the
quality of the bilingual data. Experiments using three standard test sets are conducted to
validate the proposed method; the most advanced fairseq-transformer NMT system is
used in the training. The results show that the proposed method has worked well on
improving the translation results. Compared with the state-of-the-art methods, the BLEU
value of our method is increased by 2.34 compared with that of the baseline.
Keywords
Cite This Article
Z. Li, C. Chi and Y. Zhan, "Corpus augmentation for improving neural machine translation,"
Computers, Materials & Continua, vol. 64, no.1, pp. 637–650, 2020.