|Computer Systems Science & Engineering |
An Ensemble Based Approach for Sentiment Classification in Asian Regional Language
1Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, 431004, India
2Division of Applied Mathematics, Wonkwang University, 460, Iksan-daero, Iksan-Si, Jeonbuk, 54538, Korea
3Department of Mathematics, Tamralipta Mahavidyalaya, Tamluk, West Bengal, 721636, India
4Department of Electronics and Telecommunication Engineering, AISSMSCOE, Pune, Maharashtra, 411001, India
5Department of Information Technology, Government College of Engineering, Aurangabad, Maharashtra, 431005, India
6School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, 600127, India
*Corresponding Author: Jeong Gon Lee. Email: firstname.lastname@example.org
Received: 30 January 2022; Accepted: 23 March 2022
Abstract: In today’s digital world, millions of individuals are linked to one another via the Internet and social media. This opens up new avenues for information exchange with others. Sentiment analysis (SA) has gotten a lot of attention during the last decade. We analyse the challenges of Sentiment Analysis (SA) in one of the Asian regional languages known as Marathi in this study by providing a benchmark setup in which we first produced an annotated dataset composed of Marathi text acquired from microblogging websites such as Twitter. We also choose domain experts to manually annotate Marathi microblogging posts with positive, negative, and neutral polarity. In addition, to show the efficient use of the annotated dataset, an ensemble-based model for sentiment analysis was created. In contrast to others machine learning classifier, we achieved better performance in terms of accuracy for ensemble classifier with 10-fold cross-validation (cv), outcomes as 97.77%, f-score is 97.89%.
Keywords: Sentiment analysis; machine learning; lexical resource; ensemble classifier
In this digital age, millions of people are connected to one another through Web 2.0 and social networking. This allows for a new technique of exchanging knowledge with other people. Social networking sites, e-commerce websites, blogging, and other similar platforms allow users to instantly generate creative content, thoughts, and opinions, leading in the development of huge amounts of data every day. Sentiment analysis and opinion mining have grown as a challenging and dynamic field of research for both resourced and under-resourced languages. The term sentiment refers to a broad concept that encompasses sentiment, evaluation, appraisal, or attitude toward a piece of information that demonstrates the author’s point of view.
Opinion mining or emotional intelligence are terms used to describe sentiment analysis. Sentiment analysis is the systematic process of extracting useful knowledge from unstructured and unorganized text information in various social platforms and online sources, such as chats on Twitter, WhatsApp, and Facebook, as well as online blogs and comments. Opinion mining includes establishing automated systems that employ any of the machine learning methods to accomplish opinion mining.
The number of Marathi internet users and web content has grown tremendously. Because Marathi is still an under-resourced language in the field of sentiment analysis, there have been few attempts to perform SA in Marathi. Users express their opinions in a variety of methods, including bilingual text, transliterated words, emoticons, spelling variations, incorrect linguistic structures, and many others . This makes sentiment analysis a difficult field for research, particularly with Indian languages. This allows for the development of Marathi resources and research in the field of sentiment analysis.
The major contributions of this research work are the development and evaluation of lexical resources for sentiment analysis in Marathi, because there are minor lexical resources, libraries, and lexical Corpus available for Marathi, indicating that Marathi has not been explored in the field of sentiment analysis. In this research, we present an ensemble-based model for predicting the sentiment of Marathi texts through integrating the output of Machine Learning-based models. And for developing benchmark dataset, we manually annotated the Twitter dataset with the help of human annotators (domain experts), who are senior researchers in Marathi, and for analysis of these annotators’ performance, we used Fleiss’ kappa as performance measurement matrices, and lastly, all classification algorithms are also evaluated and discussed. In addition, an annotated dataset of Marathi tweets with positive, negative, and neutral sentiment orientations was created.
2 Related Work
In recent years, only a few Indian languages have been studied, including Hindi, Telugu, Tamil, Telugu, and Bengali. However, as Indian people’s digital literacy grows and technology becomes easier to utilize for creating content in Indian languages, countries like India will be capable of creating content in regional languages on the Internet.
Authors have proposed ensemble-based model sentiment analysis of Persian text [2–4]. Sentiment analysis is performed using deep learning and shallow approaches. In experimentation, achieved accuracy is up to 79.68% . Researchers proposed an ensemble-based recommender system for hotel reviews and also categorized aspects . And used ensemble of binary classification known as BERT technique, with features as Word2Vec, subjectivity score and Term Frequency-Inverse Document Frequency (TF-IDF), achieved performance of model with 84% f-score and 93.26% accuracy . In proposed ensemble model for feature extraction author has considered Information Gain (IG), Gini Index and Chi Square. And used machine learning algorithms as Sequential Minimal Optimization (SMO), Multi-nominal Naïve Bayes (MNB), and Random Forest (RF) and considered multi-domain dataset.
The researchers studied the use of Naive Bayes (NB) and Support Vector Machine for machine learning-based sentiment classification of movie reviews (SVM) [8–12]. Sentiment Analysis is a two-class classification problem comprising Positive and Negative classes; this kind of study may be used to classify textual information and feature selection affects classifier performance.
The Authors have performed comparative performance weight of each binary classifier in the training sample set is computed for enhanced one-vs-one (OVO) technique based on the K nearest neighbours and the class centre of each category in the training sample set about the classification algorithm . The information gain (IG) approach is used to identify the key features for multi-class sentiment analysis; a binary SVM classifier is then trained on feature extraction training of every pair of sentiment categories. Ensemble approaches, as alternative to using each of the individual learning algorithms alone, employ many learning algorithms to achieve greater efficiency . Deep learning techniques’ performance can be improved by combining them with standard approaches based on manually acquired features .
Machine Learning based techniques has played a significant role in Natural Language Processing . Machine learning techniques are divided into two learning classes as supervised and unsupervised learning. For task of Sentiment analysis mostly preferred supervised algorithms as Support Vector Machine (SVM), Maximum Entropy and Naïve Bayes (NB) [17–19]. It includes feature-based sentiment analysis and summarization.
3 System Development
This section describes corpus creation process, pre-processing, manual annotation, and performance evaluation of human annotator with the help of Fleiss’s Kappa . And proposed ensemble-based model for sentiment classification.
3.1 Corpus Creation
3.1.1 Corpus Acquisition
We have extracted publicly available Marathi Tweets from twitter with the help twitter-API. Initially, we have collected generalized 1493 Marathi Tweets.
3.1.2 Data Preprocessing
Initially, pre-processed the data into the necessary forms, for which following steps are carried out:
• Identified and eliminated duplicate and irrelevant tweets manually.
• Identified and transliterated English words present in tweets into Marathi manually.
• Removed stop words.
• Performed lemmatization to find root word.
• Removed any incorrect punctuation, smileys, hashtags, or photo tags.
• Removed complicated sentences since they are inappropriate for performing sentiment analysis.
3.1.3 Data Annotation
We chose three domain experts who are senior scholars with a Ph.D. in Marathi to do manual data annotation with the help of human annotators. We asked them to tag the Marathi Tweets dataset with 1, 0, and −1 to represent the positivity, neutrality, and negativity of Marathi tweets.
3.2 Feature Extraction
Supervised Machine learning methods generates output for test data by learning from a pre-defined set of features in the training samples . As Machine learning methods cannot directly works on raw text, as result feature extraction methods are required to transfer text into a vector of features. In this research work we are considering unigram with Term Frequency–Inverse Document Frequency (TF-IDF) for feature extraction. Mostly, unigram i.e., single words hold important opinions, emotion . For example, “Camera of this mobile is good”, here word “good” expresses opinion about camera. So, it becomes important for to consider Unigram + TF-IDF model for feature extraction.
The unigram word vectors obtained during initial stage are used to build a matrix containing all of the tweets, and the unigrams recovered from the matrix are handled as features. The TF-IDF feature matrix is constructed with the features as columns and tweets as rows. The Lexical TF-IDF is calculated by multiplying each feature column of the TF-IDF feature matrix by its sentiment score. This matrix is used to train supervised machine learning algorithms.
3.3 Sentiment Classification Approach
To learn and classify, machine learning algorithms employ various series. The names of the input feature vectors and their classes are included in the training set. Using this training set, a classification model was created to classify the input material into positive and negative class . Extracted feature sets are applied to train the classifier to evaluate if the data set review is positive or negative. Ensemble techniques are a type of machine learning methodology that integrates numerous base models to create a single best prediction model.
3.3.1 Logistic Regression (LR)
Logistic regression estimates probabilities using a logistic function, which is the cumulative logistic distribution, to assess the association between a categorical dependent variable and one or more independent variables [24–28]. Logistic regression is a linear approach; however, the logistic function is used to modify the predictions. It is a statistical technique for assessing a dataset that has one or more independent variables that affect the outcomes.
Instead of fitting a regression line, we fit a "S" shaped logistic function that predicts two maximum values in logistic regression (0 or 1). Logistic regression starts with a conventional linear regression and then adds a sigmoid to the linear regression result. Regression is expressed Eq. (1) and logistic function in Eq. (2).
where, w0 indicates weights and x1 represents independent variables.
3.3.2 Stochastic Gradient Decent (SGD)
Stochastic Gradient Descent (SGD) is a straightforward but highly efficient method for fitting linear classifiers and regressors to convex loss functions. SGD has been effectively used to large-scale and sparse machine learning applications, such as text categorization and NLP. Given the sparsity of the data, the classifiers in this module can efficiently scale to problems with more than training instances and more than features. The class SGD Classifier provides a simple stochastic gradient descent learning process that supports various classification loss functions and penalties. The decision boundary of an SGD Classifier trained with the hinged loss, which is comparable to a linear SVM.
3.3.3 Support Vector Machine (SVM)
The Support Vector Machine (SVM) is a well-known supervised machine learning model for categorization and prediction of different datasets. Several studies claim that SVM is a fairly accurate approach for text categorization. It is also often used in sentiment analysis.
For example, if we have a dataset with data that has been pre-labelled into two categories: positive and negative labels in Fig. 1, we may train a model to classify real time data into these two categories. This is precisely how SVM operates. We train the model on a dataset so that it can evaluate and classify unknown data into the categories that were present in the training set.
3.3.4 Naive Bayes (NB)
The Naive Bayes classifier is a prominent supervised classifier that allows you to express positive, negative, and neutral sentiments in content. To classify words into their respective categories, the Naive Bayes classifier employs conditional probability. The advantage of using Naive Bayes for text classification is that it just requires a minimal dataset for training. The raw data is pre-processed, with removal of stop words, punctuation marks, extra spaces, transliteration of other language words and special symbols removed. Human annotator performs the manual tagging of words with labels of positive, negative, and neutral tags.
It can be beneficial for determining the likelihood of each statement using sentiment. In this technique, each attribute helps to selecting which labelling should be allocated to the emotion value of each phrase. The Naive Bayes classifier starts by computing the prior probability of each labelled sentence, which is derived by examining the occurrence of each labelled statement in the training data set. Following Eq. (3) describes bayes rule.
where, A is Particular class, B sentence which needs to be classified, P(A) and P(B) are Prior probabilities, and P(A | B) and P(B | A) are Posterior probabilities.
3.3.5 Nearest Neighbour
Nearest Neighbours (KNN) is an important classification technique in Machine Learning. It is a supervised learning algorithm that is widely used in text classification. It is extensively applicable in real-world circumstances since it is non-parametric, which means it makes no underlying assumptions regarding data distribution. We are provided some previous data (also known as training data) that classify locations into categories based on a characteristic.
3.3.6 Ensemble Classifier
The purpose of Ensemble techniques is to integrate the predictions of numerous base estimators with a specific learning algorithm to increase the classifier’s accuracy and resilience. The idea behind the Voting Classifier is to integrate conceptually distinct machine learning classifiers and forecast the class labels using a majority vote or the average projected probability (soft vote). Such a classifier can be effective for balancing out the individual flaws of a set of similarly highly performing classifiers.
Fig. 2 shows An Ensemble based Sentiment classification approach using supervised Machine Learning algorithms. And Algorithms are Support Vector Machine (SVM), Nave Bayes (NB), k-Nearest Neighbour (KNN), Neural Network, Decision Tree (DT), Logistic Regression (LR), Stochastic Gradient Decent (SGD), and the proposed Ensemble-based Model are implemented in research work.
4 Performance Evaluation
4.1 Data Annotation: Inter-annotator Agreement Score
We employed the Fleiss’ Kappa inter annotator agreement score to evaluate manual data annotation evaluation between annotator. Fleiss’ kappa score is calculated using the formula below (Wik21).
Where, the factor represents the degree of agreement that can be obtained other than by chance, The degree of agreement that was achieved above chance is given by . and if the evaluators are totally in agreement, Kappa k = 1 and k = 0 if there is no agreement among the evaluators (other than what would be expected by chance). And for Marathi Tweets dataset the inter-annotator agreement score is k = 0.957, which is almost perfect agreement. Tab. 1. Inter-Annotator agreement score shows Inter-Annotator agreement score and Tab. 2. The statistics for Marathi tweets dataset after preprocessing and data annotation. shows the statistics for Marathi tweets dataset after preprocessing and data annotation. Inter- Annotator agreement score and the statistics for Marathi tweets dataset after preprocessing and data annotation are shown in graphical manner in Figs. 3 and 4. respectively.
4.2 Performance of Sentiment Classification Approach
We concentrated on three sorts of class problems in the experiment: positivity, neutrality, and negativity. Using the Twitter API, we retrieved Marathi tweets. Furthermore, the Marathi Tweets dataset is classified into three groups depending on the sentiment represented in the statements. If the expressed attitude indicates positivity, then labelled as 1, if it is neutrality then labelled as 0, and if it is negativity then labelled as −1.
The dataset is partitioned into 75:25 ratios for training and testing datasets. The dataset is subjected to different preprocessing methods, including data cleaning, URL and Hashtag removal, unnecessary blank spaces, emojis, removal of Stopword, and lemmatization. k-fold cross validation with k = 5 and k = 10 was also employed.
And evaluation metrics used are F-score and Accuracy which are calculated as below.
Analyzed comparative results from base classifiers, majority voting ensemble, and developed ensemble classifier. The proposed ensemble classifier’s performance is compared to that of the individual conventional classifier and the majority voting ensemble classifier. Tab. 3. displays the results. On Marathi datasets, the suggested ensemble classifier outperformed the stand-alone classifier and the majority voting ensemble classifier.
A classification model may be assessed using a variety of metrics, the most basic of which is accuracy and f-score. Tab. 3. shows the performance evaluation of individual classifier with k-fold validation. Graphical representation of performance evaluation of individual classifier with k-fold validation is shown in Figs. 5 and 6.
We performed 5-fold cross validation (cv) on dataset, for individual classifier Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB), K- Nearest Neighbour (KNN), Neural Network (ANN), Decision Tree (DT), Logistic Regression (LR), Stochastic Gradient Decent (SGD), we obtained accuracy as 92.46%, 90.76%, 91.98%, 93.40%, 91.71%, 90.76%, and 95.47% respectively and obtained better performance in terms of accuracy for ensemble classifier as 96.77%, f-score is 98.73%. For 10-fold cross validation (cv) on dataset, individual classifier SVM, MNB, KNN, ANN, DT, LR, and SGD, we obtained accuracy as 91.89%, 89.53%, 89.63%, 92.83%, 90.90%, 88.97%, and 96.13%, respectively and we obtained better accuracy for ensemble classifier as 97.77%, f-score is 97.89% for Marathi tweets dataset.
4.3 Result Discussions
This is the first attempt to develop and evaluate a machine learning-based ensemble classifier for Marathi, and because there are no results for the same language, we compared our model with Hindi and Konkani for result analysis because these languages are considered for sentiment analysis using Machine Learning algorithms, and they are also in the Devanagari language family. The authors employed machine learning techniques such as Naive Bayes, Decision Tree, and Support Vector Machine (SMO) using the Weka tool to reach accuracy of 50.95%, 54.48%, and 51.07% for the electronics product review dataset in Hindi . In the case of Konkani, the authors used a dataset of Konkani poetry with Naive Bayes classification and attained an accuracy of 82.67% [26–28]. Furthermore, we have obtained better classification results for ensembled based classifier as 96.77%, 97.77%, for 5-fold and 10-fold cv respectively.
This research work presents a benchmarked technique for Sentiment Analysis of an Asian language “Marathi”. For which we created an annotated corpus of Marathi Tweets, and performed manual data annotation with the help of domain experts with tweets labelled as positivity, neutrality and negativity polarity score that is 1, 0, and −1. And for performance evaluation of manually annotated corpus we used Fleiss’s kappa (Inter-annotator agreement score) metrics and achieved average kappa score k = 0.957, which is almost perfect agreement between inter-annotator. For ensemble-based Sentiment classification experimentation, obtained better performance in terms of accuracy for ensemble classifier with 5-fold cross validation (cv) 96.77%, f-score is 98.73% and with 10-fold cross validation (cv), we obtained better accuracy for ensemble classifier as 97.77%, f-score is 97.89% for Marathi tweets dataset in comparison with another machine learning classifier.
Acknowledgement: The authors wish to express their thanks to one and all who supported them during this work.
Funding Statement : This paper was supported by Wonkwang University in 2022.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|