Open Access iconOpen Access

ARTICLE

Analyzing COVID-19 Discourse on Twitter: Text Clustering and Classification Models for Public Health Surveillance

Pakorn Santakij1, Samai Srisuay2,*, Pongporn Punpeng1

1 Department of Information Technology, Lampang Rajabhat University, Lampang, 52100, Thailand
2 Department of Computer Science, Lampang Rajabhat University, Lampang, 52100, Thailand

* Corresponding Author: Samai Srisuay. Email: email

Computer Systems Science and Engineering 2024, 48(3), 665-689. https://doi.org/10.32604/csse.2024.045066

Abstract

Social media has revolutionized the dissemination of real-life information, serving as a robust platform for sharing life events. Twitter, characterized by its brevity and continuous flow of posts, has emerged as a crucial source for public health surveillance, offering valuable insights into public reactions during the COVID-19 pandemic. This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets. Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models. An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model (GSDMM), which utilizes trigram and bag-of-words (BOW) feature extraction, outperformed Non-negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA), and a hybrid strategy involving Bidirectional Encoder Representations from Transformers (BERT) combined with LDA and K-means to pinpoint significant themes within the dataset. Among the models assessed for text clustering, the utilization of LDA, either as a clustering model or for feature extraction combined with BERT for K-means, resulted in higher coherence scores, consistent with human ratings, signifying their efficacy. In particular, LDA, notably in conjunction with trigram representation and BOW, demonstrated superior performance. This underscores the suitability of LDA for conducting topic modeling, given its proficiency in capturing intricate textual relationships. In the context of text classification, models such as Linear Support Vector Classification (LSVC), Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), Convolutional Neural Network with BiLSTM (CNN-BiLSTM), and BERT have shown outstanding performance, achieving accuracy and weighted F1-Score scores exceeding 80%. These results significantly surpassed other models, such as Multinomial Naive Bayes (MNB), Linear Support Vector Machine (LSVM), and Logistic Regression (LR), which achieved scores in the range of 60 to 70 percent.

Keywords


Cite This Article

APA Style
Santakij, P., Srisuay, S., Punpeng, P. (2024). Analyzing COVID-19 discourse on twitter: text clustering and classification models for public health surveillance. Computer Systems Science and Engineering, 48(3), 665-689. https://doi.org/10.32604/csse.2024.045066
Vancouver Style
Santakij P, Srisuay S, Punpeng P. Analyzing COVID-19 discourse on twitter: text clustering and classification models for public health surveillance. Comput Syst Sci Eng. 2024;48(3):665-689 https://doi.org/10.32604/csse.2024.045066
IEEE Style
P. Santakij, S. Srisuay, and P. Punpeng "Analyzing COVID-19 Discourse on Twitter: Text Clustering and Classification Models for Public Health Surveillance," Comput. Syst. Sci. Eng., vol. 48, no. 3, pp. 665-689. 2024. https://doi.org/10.32604/csse.2024.045066



cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1813

    View

  • 244

    Download

  • 2

    Like

Share Link