A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

Authors

  • Kieran Pichai Menlo School

DOI:

https://doi.org/10.47611/jsrhs.v12i4.6213

Keywords:

Machine Learning, Deep Learning, Artificial Intelligence, Large Language Model, Retrieval-Augmented Generation, Amazon Rainforest, Novel Dataset

Abstract

The evolution of natural language processing has seen marked advancements, particularly with the advent of models like BERT, Transformers, and GPT variants, with recent additions like GPT and Bard. This paper investigates the Retrieval-Augmented Generation (RAG) framework, providing insights into its modular design and the impact of its constituent modules on performance. Leveraging a unique dataset from Amazon Rainforest natives and biologists, our research demonstrates the significance of preserving indigenous cultures and biodiversity. The experiment employs a customizable RAG methodology, allowing for the interchangeability of various components, such as the base language model and similarity score tools. Findings indicate that while GPT performs slightly better when given context, Palm exhibits superior performance without context. The results also suggest that models tend to perform optimally when paired with similarity scores from their native platforms. Conclusively, our approach showcases the potential of a modular RAG design in optimizing language models, presenting it as a more advantageous strategy compared to traditional fine-tuning of large language models.

Downloads

Download data is not yet available.

References or Bibliography

​​Bahdanau, D. C. (2016). End-to-end attention-based large vocabulary speech recognition. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4945-4949.

Devlin, J. C. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805.

Lewis, P. P. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 9459-9474.

Radford, A. N. (2018). Improving language understanding by generative pre-training. OpenAI.

Siriwardhana, S. W. (2023). Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 1-17.

Vaswani, A. S. (2017). Attention is all you need. Advances in neural information processing systems.

Yu, W. (2022). Retrieval-augmented generation across heterogeneous knowledge. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, 52-58.

Published

11-30-2023

How to Cite

Pichai, K. (2023). A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset. Journal of Student Research, 12(4). https://doi.org/10.47611/jsrhs.v12i4.6213

Issue

Section

HS Research Projects