Key insights into recommended SMS spam detection datasets

Johari, Mohammad Firdaus; Chiew, Kang Leng; Hosen, Abdul Razak; Yong, Kelvin S.C.; Khan, Adnan Shahid; Abbasi, Irshad Ahmed; Grzonka, Daniel

doi:10.1038/s41598-025-92223-1

Download PDF

Article
Open access
Published: 10 March 2025

Key insights into recommended SMS spam detection datasets

Mohammad Firdaus Johari¹,
Kang Leng Chiew¹,
Abdul Razak Hosen¹,
Kelvin S.C. Yong²,
Adnan Shahid Khan¹,
Irshad Ahmed Abbasi³ &
…
Daniel Grzonka⁴

Scientific Reports volume 15, Article number: 8162 (2025) Cite this article

6625 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Short Message Service (SMS) spam poses significant risks, including financial scams and phishing attempts. Although numerous datasets from online repositories have been utilized to address this issue, little attention has been given to evaluating their effectiveness and impact on SMS spam detection models. This study fills this gap by assessing the performance of ten SMS spam detection datasets using Decision Tree and Multinomial Naïve Bayes models. Datasets were evaluated based on accuracy and qualitative factors such as authenticity, class imbalance, feature diversity, metadata availability, and preprocessing needs. Due to the multilingual nature of the datasets, experiments were conducted with two stopword removal groups: one in English and another in the respective non-English languages. The key findings of this research have led to the recommendation of Dataset 5 for future SMS spam detection research, as evidence from the dataset’s high qualitative assessment score of 3.8 out of 5.0 due to its high feature diversity, real-world complexity, and balanced class distribution, and low detection rate of 86.10% from Multinomial Naïve Bayes. Recommending a dataset that poses challenges for high model performance fosters the development of more robust and adaptable spam detection models capable of handling diverse forms of noise and ambiguity. Furthermore, selecting the dataset with the highest qualitative score enhances research quality, improves model generalizability, and mitigates risks related to bias and inconsistencies.

Mitigating class imbalance in churn prediction with ensemble methods and SMOTE

Article Open access 09 May 2025

Enhancing counterfactual detection in multilingual contexts using a few shot clue phrase approach

Article Open access 10 April 2025

A comprehensive evaluation of oversampling techniques for enhancing text classification performance

Article Open access 01 July 2025

Introduction

Short Message Service (SMS) is the most prevalent mode of communication in today’s interconnected society. Developed by German engineer Friedhelm Hillebrand and his colleague Bernard Ghillebaert in 1984, SMS was conceived as a method for transmitting messages via the telephone network using GSM standards, a vision that came to fruition in the 1990s¹. SMS is distinct from Multimedia Messaging Service (MMS) due to its character limitations and its incapacity to transmit videos, audios, and images over traditional cellular networks^1,2,3.

While SMS offers numerous societal benefits, such as facilitating communication, it also creates opportunities for cybercriminals to innovate their deceptive tactics. One such tactic is the dissemination of spam messages, which are unsolicited electronic communications^4,5. The appeal of SMS for spam lies in the availability of unlimited pre-paid SMS packages in countries like India, Pakistan, China, and increasingly, the United States. Additionally, SMS spam often achieves higher response rates compared to email spam, as SMS is a trusted service that users frequently rely on for confidential exchanges. Consequently, SMS spam has emerged as a significant issue, imposing substantial costs related to lost productivity, network bandwidth consumption, management overhead, and compromised personal privacy⁶.

The incidence of SMS spam remains persistently high, particularly within the United States⁷. observed a notable uptick in spam text activity, with a 58.0% increase reported in 2022. Additionally⁷, noted that 1 in every 3 Americans encountered fraudulent schemes through spam texts, with 65% of these individuals only realizing they had been deceived after the fact. These findings are corroborated by⁸, who documented a staggering 157.0% surge in spam texts among Americans, amounting to 225 billion such messages in 2022 alone. Furthermore, Orred’s research revealed that individuals between the ages of 18 and 44 are particularly vulnerable to financial losses from phone scams, with 55.6% being male, 42.2% female, and 2.3% identifying as non-binary. The severity of the issue is graphically depicted in Fig. 1, as presented by⁹.

The detection of SMS spam messages offers a myriad of advantages, chief among them being the mitigation of financial losses and the restoration of trust in communication channels within society. SMS spam typically encompasses fraudulent schemes devised to deceive recipients into making payments or providing services, thereby compromising the integrity of the communication medium. Over time, such deceptive practices erode the credibility and reliability of SMS as a communication channel, resulting in decreased engagement with vital messages¹⁰. Furthermore, the identification of SMS spam messages enhances user privacy and safeguarding of sensitive information, thereby enhancing overall user experience and mitigating the intrusion of unsolicited content^11,12.

Previous studies on SMS spam detection have employed a variety of algorithms, each demonstrating notable algorithmic performance. For instance¹³, had tested the performance of Support Vector Machine models, achieving accuracies of 97.8%⁵. fine-tuned the hyperparameters of a Convolutional Neural Network, achieving an accuracy of 99.44%¹⁴., on the other hand, utilized multiple models with various word embedding techniques, finding that LSTM achieved the highest accuracy at 98.5%. Furthermore¹⁵, experimented with a modified transformer model for SMS spam detection, reaching an accuracy of 98.9% through hyperparameter tuning.

While these studies highlight the potential of advanced machine learning algorithms in detecting SMS spam, the effectiveness of these models heavily depends on the quality of the datasets used for training and evaluation. SMS spam messages pose significant challenges globally, with billions of fraudulent messages sent annually, leading to financial losses and privacy concerns. To combat this issue, researchers rely on publicly available datasets to train machine learning models for spam detection. However, while these datasets are widely used, their quality and suitability for robust spam detection remain underexplored. The performance of spam detection models is highly dependent on the characteristics of the datasets used, such as class balance, noise, and feature diversity. Yet, to the best of the authors’ knowledge, no comprehensive study has evaluated the quality of SMS spam detection datasets to understand their impact on model performance.

This study aims to fill this gap by providing insights and recommendations on ten publicly available SMS spam detection datasets. Rather than focusing solely on classifier accuracy, this research emphasizes understanding dataset characteristics and their influence on spam detection performance. Specifically, this study evaluates datasets based on quantitative and qualitative metrics, analyze the performance of Decision Tree and Multinomial Naïve Bayes models, and recommend datasets for future research based on their challenge level and quality. It is hypothesized that the characteristics of the dataset has a significant impact on the model performance, with Multinomial Naïve Bayes expected to outperform Decision Tree due to its robustness in handling high-dimensional text data. Additionally, it is anticipated that the most challenging dataset, which will be most recommended dataset, will provide valuable testbeds for improving model adaptability and robustness, offering insights into dataset selection for future SMS spam detection research. The criteria of the recommended dataset must present a significant challenge for models to achieve high accuracy compared to other datasets and must attain the highest average Likert score in qualitative assessments to ensure the quality and credibility of the recommendation.

By offering a structured framework for dataset evaluation, this study contributes to the development of more robust machine learning models and informs future research on spam detection systems. The primary contributions of this work include:

1.
A comparative analysis of Decision Tree and Multinomial Naïve Bayes model performance across all datasets.
2.
An evaluation of the quality of ten publicly available SMS spam detection datasets using quantitative and qualitative metrics.
3.
Dataset recommendation for advancing spam detection research using datasets with varying complexity and characteristics.

Literature review

Previous work on SMS spam detection

Throughout previous studies on SMS spam detection, a dataset named SMS Spam Collection v.1 has been previously utilized by^15,16,17,18. This dataset comprises 5,574 text messages, with 4,827 classified as ham (legitimate) and 747 classified as spam.

Utilizing the SMS Spam Collection v.1 and UtkMl’s Twitter Spam Detection Competition dataset^15,19 introduced a modified Transformer model for detecting SMS spam messages. This model was compared against traditional machine learning algorithms and deep learning algorithms, such as LSTM (Long Short-Term Memory) and CNN-LSTM (Convolutional Neural Network - Long Short-Term Memory). The authors found that their proposed model outperformed all other compared models across all tested metrics. Specifically, it achieved an accuracy of 98.92% on the SMS Spam Collection v.1 dataset and 87.06% on UtkMl’s Twitter dataset.

¹⁶ concentrate on enhancing SMS spam detection through the introduction of new content-based features. Their aim is to bolster the performance of spam detection methods by integrating semantic categories of words as features and reducing the feature space. Utilizing a dataset similar to those used in previous studies, namely the SMS Spam Collection v.1, the authors employ a variety of models, including Naive Bayes (NB), k-Nearest Neighbors (kNN), kNN45, Support Vector Machine (SVM), Information Theoretic Co-Training, and Boosted-Random Forest. The findings indicate that the Boosted-Random Forest algorithm attained the highest accuracy of 98.47% and the highest Matthews Correlation Coefficient of 0.934 among all algorithms evaluated. Furthermore, the Boosted-Random Forest exhibited the highest Sensitivity at 89.1% and the lowest Balanced Hit Rate at 0.1%. Overall, the Boosted-Random Forest algorithm surpassed other models in terms of accuracy, Matthews Correlation Coefficient, and sensitivity, establishing itself as the most effective model for SMS spam detection in this study.

¹⁷ employed this dataset al.ong with SMS Spam Corpus v.0.1 Big to enhance SMS spam detection on mobile phones by integrating FP-Growth for frequent pattern mining and Naive Bayes Classifier for classification. They found that the SMS Spam Collection v.1 dataset al.one yielded the highest accuracy of 98.51% with a 9% minimum support, while the SMS Spam Corpus v.0.1 Big dataset’s accuracy improved by 1.15% when utilizing FP-Growth. Moreover, combining both datasets resulted in an accuracy of 98.47% with a 6% minimum support. The authors concluded that the quantity and quality of the dataset significantly influence the accuracy of the models, noting an inverse relationship between dataset quantity and the minimum support parameter. Specifically, as the dataset size increases, the required minimum support parameter for optimal performance decreases. Conversely, smaller datasets require higher minimum support values to avoid generating excessive or irrelevant features, which could negatively impact the classification model’s accuracy.

In an effort to detect SMS spam messages using the H2O framework¹⁸, employed the SMS Spam Collection v.1 dataset to evaluate the performance of various machine learning algorithms, including Random Forest, Deep Learning, and Naïve Bayes, for SMS classification. Emphasizing the significance of features such as the number of digits and the presence of URLs in accurately identifying SMS spam messages, the authors found that Random Forest recorded the highest precision, recall, F-measure, and accuracy scores compared to the other models, albeit with the slowest runtime. Conversely, while Naïve Bayes achieved lower scores in precision, recall, F-measure, and accuracy, it excelled in terms of runtime efficiency.

Previous work on data curation and data quality assessment

²⁰ defines data curation as the process of organizing and managing large volumes of data to streamline the annotation process. Data curation determines the starting and labelling points to ensure efficient resource use when handling extensive datasets. The author emphasizes that data curation is crucial for businesses aiming to optimize their data processes, save significant time for machine learning engineers, focus more on model development, and facilitate the integration of models into business workflows. Essentially, data curation involves managing, annotating, and organizing data to ensure it is of the highest quality, accessible, and usable, which is essential for achieving optimal model performance²⁰.

Before delving into the previous practice of data curation, it is essential to understand the issues related to data quality²¹. have identified a range of issues that can compromise dataset quality, thereby affecting the integrity and effectiveness of machine learning models. In addition to common data quality issues such as spelling errors, duplicate records, conflicting fields, and inconsistencies, the authors also highlighted other problems including insufficient metadata, labelling errors, pre-processing challenges, dataset biases, and low annotation quality from crowdsourced data, a claim that is supported by²².

In light of these data quality issues²¹, have reported the risks associated with the aforementioned dataset quality problems. Poor data quality can lead to a decline in model performance, evidenced by decreased metrics such as accuracy, precision, and recall. Additionally, model reliability and stability are compromised, as poor-quality datasets can render models unreliable or unstable, diminishing their practical utility²¹. Furthermore, these issues can result in incorrect or misleading conclusions, posing risks and potentially causing losses in business decisions. They also present security threats, such as privacy breaches and susceptibility to malicious attacks²¹.

To mitigate these risks, various data curation criteria and quality assessments have been proposed²²., for example, outlined a four-step process: assessing datasets, identifying data quality issues, evaluating metadata, and preparing a metadata report. Although the data evaluation framework proposed by²³ is aimed at Intrusion Detection System (IDS) datasets, it also validates the data quality assessment framework proposed by²². For instance²³, evaluated the IDS dataset based on its completeness in network configuration and traffic to accurately represent real-world scenarios and simulate genuine attack behavior. The authors also assessed dataset reliability by examining the accuracy of tagging and labelling. Furthermore²³, emphasized the importance of comprehensive data documentation, or metadata, noting that insufficient metadata reduces a dataset’s usability for other researchers.

Additionally²³, introduced several unique data evaluation criteria, such as anonymity and heterogeneity. The authors highlighted the importance of privacy protection in datasets as well as balancing privacy concerns to maintain the dataset’s usefulness. Moreover, the authors had also pointed out that incorporating heterogeneous data sources can enhance detection capabilities, making the dataset more valuable to the research community.

Looking at higher-dimensional data evaluation metrics²¹, utilized the concept of the dataset lifecycle, which comprises several processes from data collection to data destruction, as illustrated in Fig. 2. As visualized in the figure, each of these processes has its own quality evaluative metrics which contributes to the overall quality of the data.

The overall work of²¹ is particularly beneficial for individuals interested in generating their own datasets. For those who, like in the current research, are focused on collecting and utilizing publicly available datasets, only data testing portion of²¹ is pertinent, in which this research will employ. Additionally, the metrics involved in data collection and data annotation can optionally be used to enhance the interpretability of the results and to ensure that the outcomes are more reliable and accurate.

Methodology

Methodology description

In this research, a comparative analysis – based approached is utilised to evaluate the quality of ten available SMS spam detection dataset. This methodology is presented in Fig. 3, and is a similar replication of the methodology used by²⁴. It consisted of several phases: Problem Understanding, Data Collection, Data Understanding, Data Preparation, Modelling, and Evaluation.

The first phase involved identifying the problem and the objectives of the research. This included choosing a topic of interest and pinpointing relevant issues that require further study in the area of SMS spam detection. Additionally, research objectives were formulated to define the scope of the research. A thorough literature review of SMS spam detection was conducted, with particular reference to the work of²⁵. The work of²⁵ was instrumental in studying the standard structure of SMS spam datasets and the spam detection code used. Furthermore, previous data curation practices from the past few years has also been reviewed to provide more insight on the process to be implemented in this research.

The second phase was data collection, which involved sourcing previously published research articles in the area of SMS spam detection that included the authors’ data and spam detection code. This step was crucial as it provided an opportunity to study the multiple view of the structure of SMS spam datasets and the associated detection code. Following this, additional SMS spam datasets were collected from various platforms, including GitHub, Kaggle, and Google Dataset Search. Collecting datasets from these diverse and reputable sources was essential to ensure a comprehensive and robust dataset for this research. The dataset selection process is not restricted to English-language datasets. Any dataset will be utilized in this research unless it lacks essential components, such as raw text messages or labels (spam/ham), rendering it unsuitable for analysis.

The third phase involves understanding the collected data by summarizing the characteristics and attributes of each dataset. Akin to the dataset testing phase proposed by²¹, this phase includes identifying the number of rows and columns, the language used as well as the similarities and differences between each dataset. This phase concurrently involves identifying the quality criteria of the dataset. As discussed in more detail in later section, the criteria consist of the authenticity of the source, class imbalance, the diversity of features, the availability of metadata, and the data preprocessing requirement. This leads to the understanding the quality level of the data which helps to identify the subsequent steps needed for analysis. This step is crucial for the qualitative section of the discussion, where the correlation between the structure of the dataset and the model performance and data reusability is discussed. Additionally, this phase involves understanding the spam detection code used by²⁵ for the subsequent phase. Acknowledging the choice of classification models by²⁵ had led to the formulation of a hypothesis that serves as an expectation of the outcome of the research.

The next phase is data preparation, a crucial step to prepare the data for modelling, which involves data preprocessing. Each dataset is then standardized to only two columns; v1 (category of message) and v2 (text message). This aligns with the authors’ spam detection code, which only accepts these columns of data. Additionally, data integration was applied, since some datasets contains a collection of data in separated Excel files.

The fifth phase is to execute the spam detection code by²⁵ with the previously prepared data. The current research uses Google Colaboratory as the Integrated Development Environment for executing the code. Google Drive was also utilized and mounted to the Google Colaboratory notebook to mount the dataset to the spam detection code. The code uses two classification models; Decision Tree and Multinomial Naïve Bayes. Additionally, the code involves splitting the data into training and testing set in a portion of 80:20, respectively.

The final phase is analysing the experimental results. The generated evaluation metrics and confusion matrix from both classification models were documented and studied to understand its performance. The research is then iterated by using different datasets to study the performance of the models. The most accurate and reliable models was documented during each execution of different dataset. Additionally, the experiment was iterated by modifying the code to remove the language stopwords of the datasets with language other than English, and the result was recorded, compared and justified, allowing for an unbiased recommendation of datasets.

Dataset description

The current study utilizes 10 publicly available datasets on SMS spam detection, obtained from relevant journal articles and dataset repositories. The datasets are in different languages or is a transliteration of language other than English. Each dataset exhibits varying levels of class imbalance. For instance, datasets 1, 3, and 9 display highly imbalanced class distributions, while datasets 6 and 8 have a perfectly balanced class distribution. The remaining datasets have moderately imbalanced distributions. In terms of source, most datasets (2, 5, 6, and 9) were collected from GitHub, while datasets 3, 8, and 10 were retrieved from Kaggle. Dataset 1 and dataset 7 were sourced from Archive and Zenodo, respectively. It is important to note that since Dataset 7 is non-proprietary, the collection for the dataset entails contacting the authors of the dataset through Zenodo before the authors grant access to the download link of the dataset. Unfortunately, the link to the source of dataset 4 and its associated metadata is unavailable; it was retrieved for this research on February 20, 2024. Table 1 provides additional information on each dataset. Additionally, Fig. 4 illustrate the visualization of the class distribution for each dataset. As seen in Table 1, the datasets are either in English language (Dataset 1 and Dataset 5), monolingually non-English language (Dataset 2, Dataset 4, Dataset 7, and Dataset 8), transliterated non-English language (Dataset 9), monolingually non-English languages combined with English language (Dataset 3), or transliterated non-English language combined with English language (Dataset 6 and Dataset 10). The transliterated non-English language datasets are different from direct translation of non-English language datasets in a sense that the text structure is generally preserved in the transliterated dataset as opposed to being transformed entirely.

Table 1 Overview of the dataset, including the class distribution, Gini coefficient, Language used, and source.

Subjects

Abstract

Similar content being viewed by others

Mitigating class imbalance in churn prediction with ensemble methods and SMOTE

Enhancing counterfactual detection in multilingual contexts using a few shot clue phrase approach

A comprehensive evaluation of oversampling techniques for enhancing text classification performance

Introduction

Literature review

Previous work on SMS spam detection

Previous work on data curation and data quality assessment

Methodology

Methodology description

Dataset description

Dataset issues

Risk associated with dataset issues

Preprocessing steps applied

Model hyperparameter tuning

Experimental result

Baseline analysis

Experimental result from decision tree

Results from multinomial naïve bayes

Enhanced analysis

Experimental result from decision tree

Experimental result from multinomial naïve bayes

Discussion

Comparative result of multinomial naive bayes and decision tree Across different datasets

Factors influencing performance variability in experimental runs

Assessment of datasets

Factors contributing to accuracy variations in dataset 4 and dataset 7 for both group of experiment

Factors Contributing to Enhanced Accuracy in Dataset 3 Relative to Other Datasets

Dataset recommendation

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

AI-powered optimization and numerical techniques for nanofluid heat transfer systems-a review

Search

Quick links