Topic Modeling Empowered by a Deep Learning Framework Integrating BERTopic, XLM-R, and GPT

Nooria Aamir; Ali Raza; Muhammad Waseem Iqbal; Dr. KHALID HAMID; Zaeem Nazir; Ayyan Asif; Samia Hussain; Hafiz Abdul Basit Muhammad

Authors

Nooria Aamir Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.
Ali Raza Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.
Muhammad Waseem Iqbal Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.
Dr. KHALID HAMID Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.
Zaeem Nazir Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.
Ayyan Asif Master of Science in Data Analytics (Stem) Department of Computer Science New Mexico State University, Las Cruces, NM
Samia Hussain Department of Computer Science, UET Lahore, 54000, Pakistan.
Hafiz Abdul Basit Muhammad Department of Computer Science and Information Technology, Superior University Lahore, Lahore, 54000, Pakistan.

Keywords:

Topic Modeling, Language modeling, XLM-R, GPT

Abstract

Topic modeling facilitates the identification of hidden themes and patterns in large text collections. It enables a thorough investigation of the messages contained in texts. Topic modeling is a popular research subject, with several translations already being investigated, including English and Arabic. However, there is a need for more research into low-resource languages, including Urdu. In this study, we propose using the BERTopic, XLM-R, and GPT frameworks on Urdu text. The proposed approach, which includes fine-tuned BERT, XLM-R, and GPT models, aims to capture the contextual nuances and grammatical intricacies of Urdu text. In this investigation, we used existing Urdu textual data. We evaluated the performance of our proposed approaches to existing techniques such as LDA and NMF utilizing coherence and diversity measures. The results show that our proposed strategy outperforms existing methods, with an average coherence improvement of 0.05 and a diversity score of 0.87. These findings demonstrate the efficacy of the proposed approach in extracting significant topics from Urdu texts, hence assisting scholarly endeavors in comparative studies of Urdu translations. Integrating real-time Urdu topic modeling into social media and news monitoring systems can help in trend analysis, misinformation detection, and sentiment-aware content moderation. Another practical application is the incorporation of topic modeling in Urdu search engines and recommendation systems, improving information retrieval for Urdu-speaking users.

Topic Modeling Empowered by a Deep Learning Framework Integrating BERTopic, XLM-R, and GPT

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

HJRS

ISSN

Online First

Call for Papers

Make a Submission

Open Access

Information

Conference