Ensemble stacked model for enhanced identification of sentiments from IMDB reviews

Azim, Komal; Tahir, Alishba; Shahroz, Mobeen; Karamti, Hanen; Vazquez, Annia Almeyda; Vistorte, Angel Rojas; Ashraf, Imran

doi:10.1038/s41598-025-97561-8

Download PDF

Article
Open access
Published: 18 April 2025

Ensemble stacked model for enhanced identification of sentiments from IMDB reviews

Komal Azim¹^na1,
Alishba Tahir¹^na1,
Mobeen Shahroz¹^na1,
Hanen Karamti²,
Annia Almeyda Vazquez^3,4,5,
Angel Rojas Vistorte^6,7,8 &
…
Imran Ashraf⁹

Scientific Reports volume 15, Article number: 13405 (2025) Cite this article

1971 Accesses
Metrics details

Subjects

Abstract

The emergence of social media platforms led to the sharing of ideas, thoughts, events, and reviews. The shared views and comments contain people’s sentiments and analysis of these sentiments has emerged as one of the most popular fields of study. Sentiment analysis in the Urdu language is an important research problem similar to other languages, however, it is not investigated very well. On social media platforms like X (Twitter), billions of native Urdu speakers use the Urdu script which makes sentiment analysis in the Urdu language important. In this regard, an ensemble model RRLS is proposed that stacks random forest, recurrent neural network, logistic regression (LR), and support vector machine (SVM). The Internet Movie Database (IMDB) movie reviews and Urdu tweets are examined in this study using Urdu sentiment analysis. The Urdu hack library was used to preprocess the Urdu data, which includes preprocessing operations including normalizing individual letters, merging them, including spaces, etc. concerning punctuation. The problem of accurately encoding Urdu characters and replacing Arabic letters with their Urdu equivalents is fixed by the normalization module. Several models are adopted in this study for extensive evaluation of their accuracy for Urdu sentiment analysis. While the results promising, among machine learning models, the SVM and LR attained an accuracy of 87%, according to performance criteria such as F-measure, accuracy, recall, and precision. The accuracy of the long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) was 84%. The suggested ensemble RRLS model performs better than other learning algorithms and achieves a 90% accuracy rate, outperforming current methods. The use of the synthetic minority oversampling technique (SMOTE) is observed to improve the performance and lead to 92.77% accuracy.

A hybrid dependency-based approach for Urdu sentiment analysis

Article Open access 12 December 2023

Multilingual hope speech detection from tweets using transfer learning models

Article Open access 15 March 2025

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization

Article Open access 19 November 2024

Introduction

People now share their views, opinions, and comments via social media platforms that have become a widely used medium for sharing and receiving data, information, and ideas¹. This allows billions of users to connect through these services, exchange opinions, and share ideas freely. While social media provides significant benefits such as empowering marginalized voices to speak out and engage with civil society it also has its downsides. For instance, while some individuals feel at ease expressing their thoughts constructively, others misuse the platform to spread harmful or abusive language when interacting virtually². People can use social media as a tool for self-education and empowerment for a better quality of life and health³. Social media enables people to communicate in their native languages, producing vast content for academic analysis. While English dominates, low-resource languages like Arabic and Urdu are also commonly used on platforms like Twitter.

Figure 1 presents that Urdu is among the most widely spoken languages globally and is equally prominent on social media platforms. Sentiment analysis is vital for social media, blogs, forums, and online ads, but faces challenges like a large lexicon, Natural Language Processing (NLP) overhead, and fraudulent reviews. The diversity of languages, including French, Chinese, English, Urdu, and Arabic, adds to this complexity⁴.

Urdu is spoken by a billion people worldwide, with over 169 million actively using it daily on social media to generate vast amounts of Urdu language data. However, very limited research and resources are available for languages that examine user sentiment in Urdu⁵. Sentiment analysis in Urdu is conducted using machine learning (ML) and deep learning (DL) techniques and understand people’s thoughts by analyzing subjective data. Effective Urdu sentiment analysis requires advanced preprocessing, innovative ML techniques, and sentiment lexicons to benefit Urdu-speaking industries⁶. The results of this research can be applied across various sectors. The Urdu language requires more attention and exploration from researchers, especially when compared to other languages worldwide⁷. One major problem is the lack of structured data for the Urdu language that can be used with machine learning models. So, compiling a dataset of Urdu-language tweets is a big challenge.

Many studies used ML and DL models for tasks related to the Urdu language. For example, Rafique et al.⁸ detects fabricated news in Urdu while⁹ performs cross-domain-based sentiment analysis for the Urdu language. An ML approach is used in Mehmood et al.¹⁰ for detecting threatening language in tweets. The study¹¹ makes use of recurrent neural networks (RNN) for Urdu lemmatization while¹² presents an approach to rectify spelling errors in Urdu language. ML and DL models have been the focus of various domains specifically for automated tasks. Particularly, these models have been adopted for a variety of NLP applications. Despite existing works on the Urdu language, the domain of Urdu sentiment analysis is not very well studied. This study adopts an ML approach for Urdu sentiment analysis due to its effectiveness and efficiency. The novelty of the proposed approach lies in the use of a stacked ensemble method, where multiple machine learning (ML) models, including Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM), serve as base learners, and a Recurrent Neural Network (RNN) functions as the meta-learner. The class predictions from the base learners are fed into the RNN, which then refines the output to produce more accurate sentiment predictions. This ensemble stacking model leverages the strengths of both machine learning and deep learning techniques to enhance sentiment classification accuracy. A key aspect of the proposed model is its ability to handle smaller datasets, particularly in the context of Urdu tweets and movie reviews, where traditional machine learning and deep learning models often perform suboptimally. By combining multiple learning paradigms and incorporating the TF-IDF (Term Frequency-Inverse Document Frequency) technique for feature extraction, the model is capable of improving performance even in low-resource settings. The TF-IDF method helps identify the most informative words in the dataset, further enhancing the model’s ability to differentiate between positive, negative, and neutral sentiments. The main contributions are as follows:

A hybrid technique is proposed by using ML and DL algorithms to improve sentiment analysis results. The proposed model RRLS utilizes random forest (RF), RNN, logistic regression (LR), and support vector machines (SVM) via stacking.
Two datasets are used in this research for model evaluation. The Internet Movie Database (IMDB) includes Urdu reviews of movies and Urdu tweets categorized into three sentiments: positive, negative, and neutral. Analysis, preprocessing, and feature engineering of the Urdu text data have been conducted using the Urduhack library.
Decision trees (DT), SVM, RNN, long short-term memory (LSTM), and bidirectional LSTM (BiLSTM) have been adopted in this research to conduct experiments using Urdu textual data. To examine the proposed approach, accuracy, precision, recall, and F1 score are utilized as metrics to evaluate performance.

The previous research paper’s structure is set up as follows. Section 2 emphasizes the pros and cons of existing literary studies. Section 3 describes the methodology framework and datasets. The experimental findings are then presented and examined in Section 4, where the performance of different models of Urdu sentiment datasets is evaluated. Section 5 draws conclusions based on the data and suggests potential directions for future research in Urdu sentiment analysis.

Systematic review

Sentiment analysis with ML and artificial intelligence (AI) has been suggested for various applications like social media platforms to analyze industrial behavior. Several studies have utilized social media posts for this purpose. Figure 2 shows an analysis of existing literature on Urdu sentiment analysis.

Hotel reviews written in Roman Urdu were studied by Nazir et al.¹³ which employed LR and SVM yielding an accuracy of 85.30% and 80.00%, respectively. The sentiment analysis of Roman Urdu, according to polarity, was conducted using various language models and nine ML algorithms, achieving a 92.25% accuracy with the LR model while the k nearest neighbor (KNN) model obtained a 91.47% accuracy.

In another study¹⁴, after applying noise reduction techniques to social media data, decision trees (DT) were used for classification and vectorization, resulting in an astounding 96.00% accuracy on the training dataset. Using a hybrid ML approach, the study¹⁵ performed an Urdu sentiment analysis of social media interactions. The SVM model showed an accuracy of 74.69%, and precision, recall, and F1 scores were 74.00%, 73.00%, and 74.00%, respectively. Urdu sentiment analysis was conducted from a multilingual perspective, incorporating Urdu, Roman Urdu, and a combination of both, using various ML models such as LR, DT, and RF with an accuracy of 74.00% by the RF model¹⁶.

The LR and SVM models were applied to classify reviews from the Roman Urdu Daraz online shopping website. It achieves 75.00% accuracy¹⁷ by exploring improved feature extraction techniques. The Urdu-Arabic script based on lexicon-based models was used to analyze sarcasm, achieving a 48.50% accuracy on sarcastic while a 23.50% accuracy for non-sarcastic tweets, with precision of 87.90% and recall rates of 69.60%. With a recall of 20.10% and a precision of 82.80%, an NB-based model identified 8.30% of sarcastic tweets. On the other hand, a 56.9% accuracy is obtained for non-sarcastic tweets. These results demonstrate the ongoing attempts to enhance classification techniques in Urdu sentiment research. Talat et al.³.

The movie reviews dataset is used in Haroon et al.¹⁸ to extract relevant features using term frequency/inverse document frequency (TF-IDF) and bag of words (BoW) techniques. The sentiment analysis used a convolutional neural network (CNN), LSTM, RNN, SVM, and NB. The model’s performance was evaluated using several metrics. The ML models showed accuracy ranging from 81.00% to 90.00% while DL models obtained 84.00% to 94.00% accuracy¹⁸.

The lexicon-based technique has also been employed for Urdu sentiment analysis. Using Urdu text analysis steps, an accuracy of 64.00% is reported in Rehman and Bajwa¹⁹. Twenty thousand sentences in the corpus (RU-EN-Emotion) of Roman Urdu have been classified as either emotion sentences or neutral sentences. The sentences are annotated with emotional content. Next, the efficacy of six conventional ML and DL methods is evaluated. CNN when paired with GloVe embedding, proves to be the best strategy and produces a new RU-EN-Emotion corpus that offers greater utility than the existing corpus.

The study²⁰ In the analysis of YouTube comments, six machine learning algorithms were used, including NB, SVM, LR, DT, KNN, and RF. The SVM, LR, and RF models attained the top marks for accuracy. Another study focused on the classification of multi-label poisonous comments in Urdu, employing different algorithms like binary relevance (BR), bagging, and others. By using n-gram features TF-IDF weighting enabled BR to achieve a staggering score of 96.6%, demonstrating how well it does the task of sentiment analysis²¹. CNN outperformed regarding accuracy, despite having notable flaws. CNN-based models require larger data in order to train. Second, it assumes that each word influences a statement’s polarity in the same manner. The authors suggested a CNN model with an attention module and utilized transfer learning to improve sentiment analysis²². Roman Urdu language is considered in ²³ for sentiment analysis on Pakistan Super League (PSL) anthems to categorize comments as positive, negative, or neutral, utilizing machine learning algorithms like NB, KNN, ANN, and LR. Experimental results show the highest accuracy of 97.00%.

Another study²⁴ used ML models for sentiment analysis where SVM emerged as the most effective model. More than 1,00000 examples of twelve distinct topic kinds make up our dataset. The sentences are categorized using Random Forest, a well-known ML classifier. For unigram, bigram, and trigram features, it demonstrated accuracy ranging from 64.41% to 80.15%, while bigram has a 76.88% accuracy²⁵. In Bangash et al.²⁶, Sentiment analysis employing a lexicon-based method and boolean data analysis revealed a positive relationship between the political party’s electoral success and the number of positive tweets it received. Research utilizing word embedding techniques shows a notable enhancement in outcomes when utilizing the transformer models from Hugging Face, DistilBERT, and XLNet, in contrast to LR and NB, two popular machine learning models²⁷.

Multiclass sentiment analysis An analysis of the general public’s opinion of police authority and public services provided is conducted in both Urdu and English, the regional languages²⁸ for positive, negative, and neutral attitudes. The SVM provides optimal performance for multi-classification problems with an accuracy of 86.87%. The study²⁹ worked on a massive corpus of tweets using a pre-processing pipeline. It involves removing columns that contain user information, retweet counts, follower data, redundant tweets, links, more punctuation, spaces, and symbols, and identifying whether the tweets with emojis, then taking out relevant details.

Table 1 highlights the performance and methodologies of several leading approaches, including traditional ML classifiers and advanced feature engineering techniques. By examining the effectiveness of techniques such as multinomial NB (MNB), Bernoulli NB (BNB), SVM, DT, RF, and LR, it points out the strengths and weaknesses of each method. By comparing these techniques, the table illustrates the relative effectiveness of each approach in handling sentiment analysis tasks, particularly for Urdu text. This comparative analysis is crucial for understanding the strengths and limitations of existing methods and for positioning our research within the broader landscape of sentiment analysis technologies.

Table 1 Comparing various state-of-the-art research works on sentiment analysis.

Subjects

Abstract

Similar content being viewed by others

A hybrid dependency-based approach for Urdu sentiment analysis

Multilingual hope speech detection from tweets using transfer learning models

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization

Introduction

Systematic review

Methodology

Datasets

Preprocessing

Removing the stopwords

Tokenization and lemmatization

Feature extraction

TF-IDF and count vectorizer

Machine learning

Deep learning models

Proposed ensemble model

Results and discussion

Dataset 1: IMDB movies Urdu review

Results of ML using TF-IDF

Machine learning using CountVectorizer

Deep learning model results on Urdu text

Results of proposed ensemble RRLS

Dataset 2: Urdu tweets

Machine learning models results using TF-IDF

Machine learning models results using count vectorizer

Deep learning models results on Urdu tweet dataset

Results of proposed ensemble RRLS

Dataset 3: Urdu news

Dataset 3: confidence interval

Validation using external dataset

Results on the news dataset using BERT, XLM-RoBERTa and RoBERTa

Results of proposed approach using SMOTE

Comparative analysis of the results

Results of statistical T-test

Limitations and future work

Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links