The final, formatted version of the article will be published soon.
Original Research
Exp. Biol. Med.
Sec. Artificial Intelligence/Machine Learning Applications to Biomedical Research
Volume 250 - 2025 |
doi: 10.3389/ebm.2025.10389
This article is part of the Issue Proceedings of the 10th Annual Conference of the Arkansas Bioinformatics Consortium (AR-BIC) - Real-World Impact of AI View all 4 articles
AI-Powered Topic Modeling: Comparing LDA and BERTopic in Analyzing Opioid-Related Cardiovascular Risks in Women
- 1 University of Arkansas at Little Rock, Little Rock, Arkansas, United States
- 2 National Center for Toxicological Research (FDA), Jefferson, Arkansas, United States
- 3 Office of New Drug, Center for Drug Evaluation and Research, U.S. Food and Drug Admin-istration, Silver Spring, United States
Topic modeling is a crucial technique in natural language processing (NLP), enabling the extraction of latent themes from large text corpora. Traditional topic modeling, such as Latent Dirichlet Allocation (LDA), faces limitations in capturing the semantic relationships in the text document although it has been widely applied in text mining. BERTopic, created in 2022, leveraged advances in deep learning and can capture the contextual relationships between words.In this work, we integrated Artificial Intelligence (AI) modules to LDA and BERTopic and provided a comprehensive comparison on the analysis of prescription opioid-related cardiovascular risks in women. Opioid use can increase the risk of cardiovascular problems in women such as arrhythmia, hypotension etc. 1,837 abstracts were retrieved and downloaded from PubMed as of April 2024 using three Medical Subject Headings (MeSH) words: "opioid", "cardiovascular", and "women". Machine Learning of LanguagE Toolkit (MALLET) was employed for the implementation of LDA. BioBERT was used for document embedding in BERTopic. Eighteen was selected as the optimal topic number for MALLET and 23 for BERTopic. ChatGPT-4-Turbo was integrated to interpret and compare the results. The short descriptions created by ChatGPT for each topic from LDA and BERTopic were highly correlated, and the performance accuracies of LDA and BERTopic were similar as determined by expert manual reviews of the abstracts grouped by their predominant topics. The results of the t-SNE (t-distributed Stochastic Neighbor Embedding) plots showed that the clusters created from BERTopic were more compact and well-separated, representing improved coherence and distinctiveness between the topics. Our findings indicated that AI algorithms could augment both traditional and contemporary topic modeling techniques. In addition, BERTopic has the connection port for ChatGPT-4-Turbo or other large language models in its algorithm for automatic interpretation, while with LDA interpretation must be manually, and needs special procedures for data pre-processing and stop words exclusion. Therefore, while LDA remains valuable for large-scale text analysis with resource constraints, AI-assisted BERTopic offers significant advantages in providing the enhanced interpretability and the improved semantic coherence for extracting valuable insights from textual data.
Keywords: AI, BERTopic, topic modeling, opioid, cardiovascular risks
Received: 28 Sep 2024; Accepted: 16 Jan 2025.
Copyright: © 2025 Ma, Chen, Ge, Rogers, Lyn-Cook, Hong, Tong, Wu and Zou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Ningning Wu, University of Arkansas at Little Rock, Little Rock, 72204, Arkansas, United States
Wen Zou, National Center for Toxicological Research (FDA), Jefferson, 72079, Arkansas, United States
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.