A Generative-Adversarial Approach to Low-Resource Language Translation via Data Augmentation

Authors

  • Linda Zeng The Harker School

DOI:

https://doi.org/10.47611/jsrhs.v12i4.5664

Keywords:

GAN, Generative, Generative Adversarial, Generative Adversarial Network, LSTM, Low Resource Language, Data Augmentation, Encoder, Generator, Discriminator, Encoder-decoder, Linguistics

Abstract

Language and culture preservation is a serious challenge both socially and technologically. This paper proposes a novel, data augmentation approach to using machine learning to translate low-resource languages. Since low-resource languages, such as Aymara and Quechua, do not have many available translations that machine learning software can use as reference, machine translation models frequently err when translating to and from low-resource languages. Because models learn the syntactic and lexical patterns underlying translations through processing the training data, an insufficient amount of data hinders them from producing accurate translations. In this paper, I propose the novel application of a generative-adversarial network (GAN) to automatically augment low-resource language data. A GAN consists of two competing models, with one learning to generate sentences from noise and the other trying to tell if a given sentence is real or generated. My experiments show that even when training on a very small amount of language data (< 20,000 sentences) in a simulated low-resource setting, such a model is able to generate original, coherent sentences, such as "ask me that healthy lunch im cooking up,” and “my grandfather work harder than your grandfather before.” The first of its kind, this novel application of a GAN is effective in augmenting low-resource language data to improve the accuracy of machine translation and provides a reference for future experimentation with GANs in machine translation.

Downloads

Download data is not yet available.

References or Bibliography

Afroz Ahamad. 2019. Generating text through adversarial training using skip-thought vectors. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 53–60, Minneapolis, Minnesota. Association for Computational Linguistics.

Federico Betti, Giorgia Ramponi, and Massimo Piccardi. 2020. Controlled text generation with adversarial learning. In Proceedings of the 13th International Conference on Natural Language Generation, pages 29–34, Dublin, Ireland. Association for Computational Linguistics.

Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. 2021. Neural machine translation with monolingual translation memory.

Wei-Rui Chen and Muhammad Abdul-Mageed. 2021. Machine translation of low-resource indo-european languages.

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724– 1734, Doha, Qatar. Association for Computational Linguistics.

Francois Chollet et al. 2015. Keras.

Cheikh M. Bamba Dione. 2021. Multilingual dependency parsing for low-resource African languages: Case studies on Bambara, Wolof, and Yoruba. In Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021), pages 84–92, Online. Association for Computational Linguistics.

Sara Durrani and Umair Arshad. 2021. Transfer learning from high-resource to low-resource language improves speech affect recognition classification accuracy.

Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computational Linguistics.

Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. 2021. A theoretical analysis of the repetition problem in text generation.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks.

Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K. Li. 2018. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 344–354, New Orleans, Louisiana. Association for Computational Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Surangika Ranathunga, En-Shiun Annie Lee, Mar jana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, and Rishemjit Kaur. 2021. Neural machine translation for low-resource languages: A survey.

Ahmad Rashid, Alan Do-Omri, Md. Akmal Haidar, Qun Liu, and Mehdi Rezagholizadeh. 2019. Bilingual GAN: A step towards parallel text generation. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 55–64, Minneapolis, Minnesota. Association for Computational Linguistics.

Svetlana Tchistiakova, Jesujoba Alabi, Koel Dutta Chowdhury, Sourav Dutta, and Dana Ruiter. 2021. Edinsaar@wmt21: North-germanic low-resource multilingual nmt.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018a. Improving neural machine translation with conditional sequence generative adversarial nets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1346–1355, New Orleans, Louisiana. Association for Computational Linguistics.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018b. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46–55, Melbourne, Australia. Association for Computational Linguistics.

Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and En hong Chen. 2018. Bidirectional generative adversarial networks for neural machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 190–199, Brussels, Belgium. Association for Computational Linguistics.

Francis Zheng, Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Low-resource machine translation using cross-lingual language model pre training. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 234–240, Online. Association for Computational Linguistics.

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575, Austin, Texas. Association for Computational Linguistics.

Published

11-30-2023

How to Cite

Zeng, L. (2023). A Generative-Adversarial Approach to Low-Resource Language Translation via Data Augmentation. Journal of Student Research, 12(4). https://doi.org/10.47611/jsrhs.v12i4.5664

Issue

Section

HS Research Articles