Investigating a Key COVID-19 Question by Using Natural Language Processing on Scientific Publications

Authors

  • Devika Dua Cedar Creek
  • John Mapes Louisiana Tech University

DOI:

https://doi.org/10.47611/jsrhs.v11i3.2977

Keywords:

natural language processing, transformer, health informatics, COVID-19, CORD-19

Abstract

The COVID-19 pandemic has brought an unprecedented challenge to public health. Numerous scientific publications are published daily on COVID-19 to understand the unexplored facets of the disease. The sheer volume of these publications makes it daunting for researchers to quickly find information and evaluate data related to specific COVID-19 queries. Natural Language Processing (NLP), a form of artificial intelligence, assists in churning these huge piles of data with a sophisticated algorithmic approach. The purpose of this study is to investigate key a COVID-19 question by using NLP on scientific publications. Using the T5 (Text-To-Text Transfer Transformer) model, we analyzed 740,000 journal abstracts for specific answers an important COVID-19 question. We performed qualitative observations, T-Tests (p-values and inferences), and accuracy metrics (Precision, Recall, and F1 score) to evaluate the models in this study. As the number of scientific publications increases, our proposed methodology provides an efficient mechanism for performing specific information retrieval for emerging questions, diseases, and related conditions, especially for underrepresented populations. 

Downloads

Download data is not yet available.

Author Biography

John Mapes, Louisiana Tech University

Cyber Research Coordinator Office of Research and Partnerships

References or Bibliography

Qin X, Liu J, Wang Y, Liu Y, Deng K, Ma Y, Zou K, Li L, Sun X. Natural language processing was effective in assisting rapid title and abstract screening when updating systematic reviews. J Clin Epidemiol. 2021 May;133:121-129. doi: 10.1016/j.jclinepi.2021.01.010. Epub 2021 Jan 21. PMID: 33485929.

Lou Z, Zhang J. Abstractive Summarization on COVID-19 Publications. CS230 Deep Learning, Stanford University. Spring 2020.

Oniani D, Wang Y. A qualitative evaluation of language models on automatic question-answering for COVID-19. Association for Computing Machinery Digital Library. 21 September 2020.

Mlconsult. (2020, May 3). Transmission, incubation and environment 2.0. Kaggle. Retrieved November 14, 2021, from https://www.kaggle.com/mlconsult/transmission-incubation-and- environment-2-0.

COVID-19 Open Research Dataset (CORD-19), available for download at https://allenai.org/data/cord-19

Wang L, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, Funk K, Kinney R, Liu Z, Merrill W, Mooney P, Murdick D, Rishi D, Sheehan J, Shen Z, Stilson B, Wade A, Wang, K, Wilhelm, C, Xie B, Raymond D, Weld D, Etzioni O, Kohlmeier S. CORD-19: The Covid-19 Open Research Dataset. National Institutes of Health. 22 April 2020. PMID: 32510522

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

T5. T5 - transformers 4.12.2 documentation. (n.d.). Retrieved November 14, 2021, from https://huggingface.co/transformers/model_doc/t5.html.

Pretrained models¶. Pretrained models - transformers 4.0.0 documentation. (n.d.). Retrieved November 14, 2021, from https://huggingface.co/transformers/v4.0.1/pretrained_models.htm

Pathak, N. (2021, September 30). Coronavirus incubation period: How long and when most contagious. WebMD. Retrieved November 14, 2021, from https://www.webmd.com/lung/coronavirus- incubation-period#1.

Devika Dua. (2021, November 13). AMIA 2021 1fbc4b. Kaggle. Retrieved November 14, 2021, from https://www.kaggle.com/devikadua/amia-2021-1fbc4b.

Hayes, A. (2021, November 13). T-test definition. Investopedia. Retrieved November 14, 2021, from https://www.investopedia.com/terms/t/t-test.asp.

Centers for Disease Control and Prevention. (2021, February 12). Management of patients with confirmed 2019-ncov. Centers for Disease Control and Prevention. Retrieved November 14, 2021, from https://www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-guidance-management- patients.html.

Sklearn.metrics.confusion_matrix. scikit. (n.d.). Retrieved November 14, 2021, from https://scikit- learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html.

Yacouby R, Axman D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. Eval4NLP. 20 November 2020.

Azunre, P. (2021, August 11). Recent advances in transfer learning for Natural Language Processing. Medium. Retrieved November 14, 2021, from https://towardsdatascience.com/why- should-you-leverage-transfer-learning-14d08a60f616.

AI, A. I. F. (2021, November 9). Covid-19 open research dataset challenge (cord-19). Kaggle. Retrieved November 15, 2021, from https://www.kaggle.com/allen-institute-for-ai/CORD-19- research-challenge.

Published

08-31-2022

How to Cite

Dua, D., & Mapes, N. (2022). Investigating a Key COVID-19 Question by Using Natural Language Processing on Scientific Publications. Journal of Student Research, 11(3). https://doi.org/10.47611/jsrhs.v11i3.2977

Issue

Section

HS Research Projects