Machine Learning Approach for COVID-19 Detection on Twitter

: Social networking services (SNSs) provide massive data that can be a very influential source of information during pandemic outbreaks. This study shows that social media analysis can be used as a crisis detector (e.g., understanding the sentiment of social media users regarding various pandemic outbreaks). The novel Coronavirus Disease-19 (COVID-19), commonly known as coronavirus, has affected everyone worldwide in 2020. Streaming Twitter data have revealed the status of the COVID-19 outbreak in the most affected regions. This study focuses on identifying COVID-19 patients using tweets without requiring medical records to find the COVID-19 pandemic in Twitter messages (tweets). For this purpose, we propose herein an intelligent model using traditional machine learning-based approaches, such as support vector machine (SVM), logistic regression (LR), naïve Bayes (NB), random forest (RF), and decision tree (DT) with the help of the term frequency inverse document frequency (TF-IDF) to detect the COVID-19 pandemic in Twitter messages. The proposed intelligent traditional machine learning-based model classifies Twitter messages into four categories, namely, confirmed deaths, recovered, and suspected. For the experimental analysis, the tweet data on the COVID-19 pandemic are analyzed to evaluate the results of traditional machine learning approaches. A benchmark dataset for COVID-19 on Twitter messages is developed and can be used for future research studies. The experiments show that the results of the proposed approach are promising in detecting the COVID-19 pandemic in Twitter messages with overall accuracy, precision, recall, and F1 score between 70% and 80% and the confusion matrix for machine learning approaches (i.e., SVM, NB, LR, RF, and DT) with the TF-IDF feature extraction technique.


Introduction
Online social network sites (SNSs) like online blogs, Facebook, Instagram, and microblogging services (i.e., Tumbler and Twitter) are web forums or online platforms that are spread over long distances all around the world. Millions of people worldwide currently use SNSs to share images and videos, update their current status, and post regular comments on various topics. SNSs can also provide massive data that can be a very influential source of information during pandemic outbreaks [1,2]. Early warning on outbreak detection can decrease the influence of epidemic outbreaks on public health. SNSs can now be used for disease surveillance to monitor the rate of epidemic outbreaks quicker than health care specialists and health organizations [2][3][4].
COVID-19 and the coronavirus pandemic have started spreading around the globe since the start of 2020. The disease is contagious and, in extreme cases, can proceed to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a novel human bacterium that epidemiologists (virologists) consider to have originated from bats and suddenly transferred to humans through an intermediary host [1,5]. Due to its prompt spread, the COVID-19 pandemic was deemed a "Public Health Emergency of International Concern" by the World Health Organization (WHO) on January 30, 2020 [6]. The disease has influenza-like symptoms (pneumonia) and has become a major challenge for healthcare professionals in terms of system development and diagnosis for monitoring the pandemic. The early detection of COVID-19 is essential in monitoring and tracking its future dissemination. SNSs can be considered as a quick detection and monitoring tool for COVID-19 to provide awareness and overcome the dissemination of the coronavirus pandemic.
Information on COVID-19 and the coronavirus pandemic have not been promptly circulated by healthcare organizations. On the contrary, SNSs have gained great attention for equally spreading awareness about COVID-19 [5,7,8]. The massive proliferation of COVID-19 and the coronavirus pandemic has developed a strong necessity for the exploration of reliable methods of analytical research to understand information dissemination and pandemic crisis formation in social media. Various research studies have examined epidemic outbreaks and monitored healthcare to more rapidly and efficiently obtain informed decisions from healthcare organizations using SNS data [4,9]. Therefore, emphasis is focused on suggesting techniques that would empower SNSs to track and detect early cautions relevant to pandemic outbreaks to realize a real-time analysis [4,10]. Through SNSs, health care practitioners can be informed to deliver basic resources to monitor pandemic outbreaks. Nowadays, people regularly use SNSs to upload images and videos, update their current status, and post regular comments on a health status, specifically during pandemic in a region. The SMA provides a piece of effectual information for outbreak tracking and a convenient approach for communicating with the public to decrease pandemic outbreaks with machine learning approaches [10][11][12][13][14][15][16][17].
A consistent feature in today's technology is that artificial intelligence plays an important role in this new wave of approaches for public health. From a methodological point of view, the machine learning approach is one of the most applicable with artificial intelligence. This study proposes an intelligent model that will retrieve text related to COVID-19 or the coronavirus pandemic from Twitter messages (tweets) using machine learning approaches, such as SVM, NB, LR, RF, and DT [18][19][20][21][22][23] with TF-IDF [24], Glove [25], and n-grams [26]. The tweets are categorized into four groups of COVID-19, namely confirmed (a tweet about a person with coronavirus), death (a tweet expressing death from COVID-19), recovered (a tweet expressing a person's recovery from , and suspected (a tweet expressing COVID-19 symptoms). The main contributions of the proposed work are as follows: • provide awareness about COVID-19 by identifying the dissemination of the latest information on COVID-19 from online social media to help prevent the dissemination of COVID-19; • automate COVID-19 analysis by detecting the COVID-19 pandemic from SNSs to perform a real-time analysis; • categorize Twitter messages related to the coronavirus and COVID-19 pandemic into four groups as "confirmed," "death," "recovered," and "suspected;" • explore traditional machine learning approaches, namely, SVM, LR, NB, RF, and DT, for tweet identification with the help of TF-IDF with the n-grams approach (e.g., following a unigram, the approach means considering the detection of the COVID-19 spread using an individual word in tweets); and • build a benchmark dataset for COVID-19 from Twitter messages that will be available online for future research studies.
This study aims to evaluate COVID-19-related tweets with "confirmed," "death," "recovered," and "suspected" patients to analyze the pandemic outbreak from the SMA. The proposed traditional machine learning-based approach is tested and evaluated on various domains to measure its performance, accuracy, and efficiency (Section 4).
The remainder of this paper is structured as follows: Section 2 provides a brief overview of the related work in the literature; Section 3 validates the approach followed to obtain the experimental results; Section 4 presents the analysis evaluation; and Section 5 concludes the research and delivers further research results.

Related Work
Several disease detection approaches for coronavirus and the COVID-19 pandemic are used by researchers around the globe to create informed decisions and develop appropriate monitoring systems [27][28][29]. Kouzy et al. [30] and Singh et al. [31] proposed intelligent models for the dissemination of information and measurements relevant to COVID-19 using online social media data.
Early detection and public awareness about outbreaks, especially the COVID-19 outbreak and coronavirus, and the techniques for monitoring the COVID-19 pandemic are major contemplations [32,33]. Kabir et al. [34] presented a method that discovers the user sentiment and posts shared by the public on COVID-19 in social media and modeled public opinion using machine learning and topic modeling techniques. They mainly investigated the psychology and actions of the public, which can be facilitated in handling financial and social crises during the current outbreak of COVID-19 and its major side effect.
Hung et al. [1] developed an artificial intelligence-based model to analyze Twitter discussion associated with public sentiment on the COVID-19 pandemic. Khanday et al. [32] developed an effective model for textual clinical data classification by empowering machine learning approaches. They classified clinical textual data into three classes that are COVID, severe acute respiratory syndrome, and acute repository distress syndrome. In addition, they presented a comparative analysis among machine learning techniques and showed that the multinomial naïve Bayes model outperformed the other models.
Mistrust of social media affects the propagation of disaster information because it not only includes changes in the interpretation and sharing of media; variations in the way individuals and administrations interpret the information in crisis circumstances also have an impact [35]. In their work, Mirbabaie et al. [35] tried to understand the crises created during the COVID-19 pandemic and the coronavirus, as well as the potential circumstances, from Twitter to decrease the mistrust of SNS content and promote the context (sense-making) of the SMA.
Aggarwal et al. [36] developed a model for a multi-criterion decision support system for COVID-19 and used the COVID-19 dataset from the government official link for result validation. Similarly, Yun et al. [37] performed a COVID-19 screening laboratory data analysis. From plasmid acid and hematology data, they gathered 2510 cases for a cumulative examination for COVID-19 infection detection. They conducted the results on influenza infections and planned to explore the effect of fecal matter. Mediating 2510 cases, they suggested clinical and medical actions. However, the data could vary from one place to another; therefore, immunity and several other factors inside the body differ from one area to another.
SNSs can be efficiently used to classify disease infected information and influences on health campaigns with interference to improve public health [9]. Motivated by literature studies, the usage of the SMA patterns of early warnings on pandemic outbreaks can be detected, consequently reducing the time that passes between onset and detection. To the best of our knowledge, previous studies have not considered the alarming situation of COVID-19 and important features like categorization of COVID-19 patients into "confirmed," "death," "recovered," and "suspected" to analyze the pandemic outbreak from the SMA. Furthermore, no benchmark dataset has been made available on the COVID-19 pandemic that delivers analysis on public sentiment. This study performs a textual analysis of Twitter data by identifying information from social sensors (referred to as tweets). Specifically, tracking of the awareness related to the prompt dissemination of the COVID-19 pandemic is analyzed. To find information on the COVID-19 pandemic in Twitter messages (tweets), the proposed work focuses on the problem of identifying COVID-19 patients using tweets without requiring medical records. Accordingly, this work proposes an intelligent model using traditional machine learning-based approaches. It also outlines an artificial intelligence approach to design an intelligent model for analyzing Twitter data in detail to identify and track the key word association and trends for disaster situations similar to the novel coronavirus and COVID-19 pandemic. Fig. 1 illustrates the proposed methodology adopted to make an intelligent approach for detecting the spread of COVID-19 pandemic in Twitter messages using machine learning techniques. The proposed model incorporates various components, including data gathering, preprocessing, data visualization, classifier, and results from the evaluation. The pseudocode for the proposed approach is also presented at the end of this section. The component details are presented below.

Data Gathering
We used the Twitter streaming application programming interface (API) to retrieve tweets from Twitter [38]. We gathered about 900,000 tweets during the period between May 13, 2020 and September 30, 2020 using the Twitter API. We selected keywords, including #covid-19, #coronavirus, #corona, covid19, and #covid to collect the relevant tweets. Fig. 2 depicts the other most commonly discussed words about COVID-19 found in a COVID-19 corpus.

Data Preprocessing
After the tweet data collection from Twitter, the collected data are promoted to certain preprocessing steps in NLP [39]: • eliminating non-English tweets (e.g., all tweets written in English are considered); • eliminating stop words: stop words, such as "a", "is," "be," and "the," do not convey meaningful information; • eliminating retweet entities: meaningful analytics would be affected by redundant (repetitive) tweets; • eliminating punctuation marks, special characters, and numbers: they do not express an opinion regarding the disease outbreak; • eliminating URLs or hyperlinks: only tweets containing text are considered herein; • eliminating people in @mention: the names of people reported in @mention are irrelevant for the disease exploration; • stemming: to transform the words into base or root words utilizing stemming techniques [40]; and • tokenizing: break a sentence or phrase into tokens, such as words, by using Natural Language Tool Kit (NLTK) modules [40].
These preprocessing steps were incorporated to enhance the performance of the proposed model and improve the processing speed. The tweet data were stored in a common separated value file after preprocessing.

Data Annotation and Building a Benchmark Dataset
A total of 3102 sample tweets on COVID-19 are selected for tagging after the preprocessing step. The sample tweets are tagged with the help of three annotators to eradicate the gaps or prejudice in an annotation. The tweets are then categorized into four groups of COVID-19, namely confirmed, death, recovered, and suspected, by the three annotators. This means that a label confirmed is assigned when someone is infected with COVID-19. For instance, tweets are considered as confirmed to reflect people with COVID-19. The suspected tweets are considered to represent the COVID-19 symptoms in people. In the annotation phase, tagged tweets are approved with the help of an inter-annotator agreement level using Cohen's Kappa test [41] and calculated as strong (i.e., kappa = 0.841) [42]. Tab. 1 shows the representation of tweets with the assigned category.

Feature Engineering
Machine learning approaches are not efficient in directly tackling the text data. For this purpose, different features are retrieved from the preprocessed annotated tweet data and transferred into probabilistic numbers. To retrieve the related features, the TF-IDF [24] feature extraction approach is utilized while unigrams and bigrams are extracted. The proposed approach is trained on approximately 5000 feature weights. Thus, we have 5000 features for the whole training set presented as max_features = 5000. After assigning the appropriate weight to the features, the numeric values of the features are moved into machine learning approaches for further analysis because machine learning approaches cannot directly analyze the text data. W m, n = tf m, n × log N tf m (1) where m in n, are the numbers of amounts and tf m shows numbers of documents consisting m while N shows total numbers of documents. A 41 year-old, healthy man with a young family just died from COVID. Death 8.
I felt equally positive after both of my parents recovered from COVID19-knowing that recovering from the disease produces the same immunity as the vaccine.

9.
I have now officially recovered from #COVID19 and have been cleared to come to work today.

10.
Many with suspected COVID19 (number not provided) ICU census unknown. Suspected 11. This is so utterly sad. COVID claims the life of someone so young, age 38, who just got elected to serve the country.

12.
Three of my friends had Corona Vaccination and they are down with fever and body aches.

Suspected
Another technique adopted herein for feature extraction is n-gram [20]. Following a unigram (1 gram), the approach means that an individual word in a tweet is considered to detect the spread of COVID-19, while a bigram (2-gram) considers two words in a tweet as it defines its corresponding word (N − 1 = 1) as the presence of the word in a suggested sentence. Consider the following example tweet to understand the n-gram approach: "I have tested positive for COVID-19." Therefore, the n-gram formulation for 2-gram (2 −1 = 1, in this context, it determines the appearance of a word) dependent on the previous work would transform the stated example as "I have," "have tested," "tested positive," "positive for," and "for COVID-19."

Data Splitting
A random split approach is adopted to split the data into training and testing. In random splitting, a pre-specified proportion of the data set is split into the train and test data samples. For instance, in the 80:20 split, the samples were spontaneously selected. Compared to the other approaches, the randomly split approach was more stable because the dataset was more correctly split up. From the 80:20 ratio, 80% of the data samples were used to train the model. The remaining 20% of the data samples were kept to test the model performance using performance evaluation metrics.

Machine Learning Approaches
Different machine learning approaches are used to detect the COVID-19 tweets and classify them into four categories of COVID-19 (i.e., confirmed, death, recovered, and suspected). In this work, machine learning approaches like LR, SVM, NB, DT, and RF are empowered to validate the proposed objectives.

Support Vector Machine
SVM is a machine learning-based approach most commonly used for classification tasks [18]. By organizing data into different groups, the SVM operates by finding a state line boundary often called a hyperplane, which separates the data set into groups. The state line boundary between vectors is related to a specific class. It is mathematically defined as follows: Suppose that vector X = (x, y) and W = (a, −1). We form a hyperplane in vector written as follows: where, x denotes the input features; w is the weight value; and b is a bias term.

Naïve Bayes
NB [22] is a probabilistic supervised learning model based on the Bayes' theorem. The fundamental concept of the NB method is to calculate the probabilities of categories allocated to the corpus and classify the test data. The Bayes algorithm presents a methodology that computes the posterior probability p (c/x) by p (c) and p (x/c) written as follows: where, p (c/x) = p (x 1 /c) .p (x 2 /c) .p (x 3 /c) . . . p (x n /c). p(c) is a posterior probability of the class (c, source) specified predictor (x, parameters), and p(c) is a prior probability of a class. The probability p (x/c) is a likelihood of a specified predictor class, and p (x) is the prior probability of a predictor. However, in the training process, the variant of the NB (MultinomialNB) commonly used for the text classification is optimized in the proposed work.

Logistic Regression
LR [21] is the most commonly used supervised method because it is used to calculate the categorical variable based on independent variables. For instance, consider a situation where it is required to classify whether a person is infected by COVID-19 or not. If linear regression is used for this scenario, then the threshold value is required to be generated on which classification can be performed. If the real class category is positive or confirmed in our case, the threshold value is 0.5, and the expected value is 0.4. The feature vector would be classified as COVID-19 negative, leading to severe consequences in real time. LR is used to overcome the limitation in linear regression considering that the LR value ranges from 0 to 1. It can be mathematically denoted as follows: where, b is a bias term; w is the weight value; and x denotes the continuous input values (e.g., the number of words in a tweet in our case) and produces the output between 0 and 1 range to classify the data into four categories.

Decision Tree
DT [23] is a simplified model used for classification problems. It is a supervised learning model in which data are separated based on certain features. DT classifies the data by sorting them down the tree to some terminal nodes from the base node, with the data identified by the terminal node. For a certain attribute, each node in the tree serves as a testing phase. Each edge descending from the node refers to the correct options for the testing phase. This mechanism is repeated for each subtree rooted throughout the new node. The entropy and entropy classes for each attribute are determined in the first phase. The information gain (IG) is determined for all the attributes defined in the following equations. This procedure is reiterated until all attributes are in the node.
where, x represents the input, and T is the current state. DT employs different techniques to determine if a node is divided into two or more sub-nodes. The sub-node formation increases the uniformity of the resulting sub-nodes. In other words, for the target variable, the node integrity can be assumed to increase. The DT divides the nodes into available attributes and determines the split that occurs in the most homogeneous sub-attributes.

Random Forest
RF [19,43], is a traditional machine learning model based on an ensemble tree because it comprises a large number of DT that performs as an ensemble. It is a set of DTs from a randomly chosen subset training set. It collates votes from various DT approaches to evaluate the actual class of the test set. The Gini index is used by RF as an input parameter that calculates the defilement of an attribute in reference to the classes. For a certain training set x, one category (pixel) is randomly picked and claimed to correspond to some categories. The Gini index is defined as: where, (f (c i , x) /x) is a probability that belongs to a certain class category c i . Thus, x represents the input values, and c is the targeted category.

Experiments and Results
This section presents the experimental results for the proposed approach. The empirical analysis was conducted using the Anaconda framework (Python 3.8) [44] with the open-source Python modules Scikit-Learn [45], Numpy [46], and Keras [47]. The performance of the proposed approach was evaluated using these modules.
The proposed approach was trained using machine learning approaches. The performance of each approach was evaluated on the test set by utilizing performance evaluation metrics [48]. Moreover, the performance of each model was graphically visualized by making a confusion matrix. A confusion matrix is a suitable approach for demonstrating the results in supervised learning problems because it reflects the output of the classification models on the testing set and attempts to evaluate the predicted (detected) dataset as per their true class label.
The obtained results depict that the SVM model led to slightly improved results. Similarly, the NB classifier performed well, as illustrated in the given figures and tables. The slight improvement in the results could be related to the length of the tweet summaries in our dataset. Tab. 2 only considers the classifiers that obtained the highest performance results with n-gram approaches.  Moreover, the confusion matrix results were generated for the selected approaches (i.e., SVM (Fig. 3), NB (Fig. 4), LR (Fig. 5), DT (Fig. 6), and RF (Fig. 7)).    The figures presented above conclude that 77% of the confirmed ratings was detected as confirmed; 76% of the suspected ratings was detected as suspected; 70% of the death ratings was detected as death; and 74% of the recovered ratings was detected as recorded. These are not the best detections, but they are a good baseline or benchmark for even better approaches using deep learning techniques.

Conclusion
As a consistent feature in today's technology, artificial intelligence plays an important role in the new wave of approaches for public health. From a methodological point of view, machine learning approaches are one of the most applicable with artificial intelligence. This study analyzed the problem of identifying COVID-19 patients using Twitter messages without requiring medical records. This framework can be used as a surveillance system for observing the COVID-19 pandemic in real time. The experimental setups, results, and evaluation of the proposed approach were illustrated to detect COVID-19-infected people on microblogging services that aim to tackle several challenges and offer a model for detecting COVID-19 pandemic to validate the proposed objectives.
The proposed intelligent traditional machine learning-based model classifying Twitter messages into four categories (i.e., confirmed, deaths, recovered, and suspected). For this purpose, a novel dataset was collected using Twitter streaming API to design a benchmark dataset for COVID-19 on Twitter messages that can be used for future research studies. The work also graphically visualized data to understand the data attributes. Data visualization revealed the highest number of the most frequently occurring keywords in the dataset. For the experimental analysis, Twitter data on the COVID-19 pandemic were analyzed to evaluate the results of the traditional machine learning approaches. The results of the proposed method were obtained using the SVM, LR, NB, RF, and DT with the help of the TF-IDF feature extraction technique. The proposed approach performance was evaluated using accuracy, precision, recall, F1 score, and confusion matrix techniques. Their results were then graphically visualized.
In the future, we aim to improve the performance of the proposed approach with deep learning approaches to analyze the novel coronavirus and the COVID-19 pandemic outbreak.
Funding Statement: This work has been supported by a grant from the Research Center of the Female Scientific and Medical Colleges, Deanship of Scientific Research, King Saud University.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.