Abstract
SMS spam detection remains a critical challenge in mobile communication security, particularly when addressing the inherent class imbalance present in real-world datasets, where spam messages constitute only 13–15% of total communications. This study presents a comprehensive framework integrating advanced word embeddings, deep learning architectures, and Generative Adversarial Networks (GANs) for synthetic data augmentation to enhance SMS spam classification performance. A systematic evaluation is conducted across six machine learning algorithms (Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), Decision Tree (DT), Stochastic Gradient Descent (SGD), Random Forest (RF)) and two deep learning models (Long Short-Term Memory (LSTM), Bidirectional LSTM (Bi-LSTM)), combined with five embedding techniques (Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BoW), Word2Vec, GloVe, Bidirectional Encoder Representations from Transformers (BERT)), resulting in 120 experimental configurations tested both with and without data augmentation. A novel GAN-based approach is employed to generate synthetic word embeddings rather than raw text, preserving semantic coherence while addressing dataset imbalance more effectively than traditional oversampling methods (Synthetic Minority Over-sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN)). Experimental validation on both the monolingual UCI SMS Spam Collection and multilingual datasets demonstrates that the optimal BERT+Bi-LSTM+GAN configuration achieves exceptional performance, with F1-scores of 97.61% (monolingual) and 94.44% (multilingual), surpassing existing state-of-the-art approaches. The comprehensive evaluation framework, incorporating Matthews Correlation Coefficient (MCC) and Cohen’s Kappa (CK), provides robust assessment for imbalanced classification scenarios. Results reveal that contextual embeddings consistently outperform traditional frequency-based methods, with BERT achieving perfect precision of 100% in baseline configurations. The study establishes strategic deployment guidelines: BERT configurations for maximum accuracy scenarios, Word2Vec approaches for balanced performance–efficiency requirements, and traditional methods for resource-constrained environments. Cross-linguistic validation confirms the universality of the approach, demonstrating only a 3.25% performance degradation in multilingual contexts. This research advances both theoretical understanding of imbalanced text classification and practical implementation of robust SMS spam detection systems, providing a methodological foundation applicable to broader cybersecurity and natural language processing challenges.
Data availability
The datasets analyzed during the current study are publicly available: - UCI SMS Spam Collection dataset : https://archive.ics.uci.edu/dataset/228/sms+spam+collection - Multilingual SMS Spam dataset : https://www.kaggle.com/datasets/rajnathpatel/multilingual-spam-data
Abbreviations
- Acc:
-
Accuracy
- ADASYN:
-
Adaptive synthetic sampling approach for imbalanced learning
- AUC:
-
Area under the receiver operating characteristic curve
- BCE:
-
Binary cross-entropy loss
- BERT:
-
Bidirectional encoder representations from transformers
- Bi-LSTM:
-
Bidirectional long short-term memory network
- BoW:
-
Bag of words
- CBOW:
-
Continuous bag of words
- CK:
-
Cohen’s kappa coefficient
- DT:
-
Decision tree
- F1:
-
F1-score
- FPR:
-
False positive rate
- GAN:
-
Generative adversarial network
- GloVe:
-
Global vectors for word representation
- KNN:
-
K-nearest neighbors
- LSTM:
-
Long short-term memory network
- LR:
-
Logistic regression
- MCC:
-
Matthews correlation coefficient
- OTP:
-
One-time password
- RF:
-
Random forest
- ROC:
-
Receiver operating characteristic curve
- SD:
-
Standard deviation
- SGD:
-
Stochastic gradient descent
- SMS:
-
Short message service
- SMOTE:
-
Synthetic minority over-sampling technique
- TF-IDF:
-
Term frequency–inverse document frequency
- TT:
-
Training time
- TTUR:
-
Two time-scale update rule
- WGAN:
-
Wasserstein GAN
- Word2Vec:
-
Word-to-vector embedding model
References
Hosseinpour, S. & Shakibian, H. Complex-network based model for SMS spam filtering. Comput. Netw. 255, 110892. https://doi.org/10.1016/j.comnet.2024.110892 (2024).
Salman, M., Ikram, M. & Kaafar, M. A. Investigating evasive techniques in SMS spam filtering: A comparative analysis of machine learning models. IEEE Access 12, 24306–24324. https://doi.org/10.1109/ACCESS.2024.3364671 (2024).
Yerima, S. Y. & Bashar, A. Semi-supervised novelty detection with one class svm for SMS spam detection. In 2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP). Vol. CFP2255E-ART. 1–4. https://doi.org/10.1109/IWSSIP55020.2022.9854496 (2022).
Ghourabi, A. & Alohaly, M. Enhancing spam message classification and detection using transformer-based embedding and ensemble learning. Sensors 23, 3861 (2023).
Rao, S., Verma, A. K. & Bhatia, T. Hybrid ensemble framework with self-attention mechanism for social spam detection on imbalanced data. Expert Syst. Appl. 217, 119594. https://doi.org/10.1016/j.eswa.2023.119594 (2023).
Lim, L. P. & Mahinderjit Singh, M. Resolving the imbalance issue in short messaging service spam dataset using cost-sensitive techniques. J. Inf. Secur. Appl. 54, 102558. https://doi.org/10.1016/j.jisa.2020.102558 (2020).
Zhu, Z. & Mao, K. Knowledge-based bert word embedding fine-tuning for emotion recognition. Neurocomputing 552, 126488. https://doi.org/10.1016/j.neucom.2023.126488 (2023).
Ilhan Taskin, Z., Yildirak, K. & Aladag, C. H. An enhanced random forest approach using coclust clustering: Mimic-III and SMS spam collection application. J. Big Data 10, 38 (2023).
Dangsawang, B. & Nuchitprasitchai, S. A machine learning approach for detecting customs fraud through unstructured data analysis in social media. Decis. Anal. J. 10, 100408. https://doi.org/10.1016/j.dajour.2024.100408 (2024).
Altunay, H. C. & Albayrak, Z. SMS spam detection system based on deep learning architectures for Turkish and English messages. Appl. Sci. https://doi.org/10.3390/app142411804 (2024).
Abid, M. A. et al. Spam SMS filtering based on text features and supervised machine learning techniques. Multimed. Tools Appl. 81, 39853–39871 (2022).
Nurhaliza Agustina, C., Novita, R., Mustakim & Rozanda, N. E. The implementation of tf-idf and word2vec on booster vaccine sentiment analysis using support vector machine algorithm. Proc. Comput. Sci. 234, 156–163. https://doi.org/10.1016/j.procs.2024.02.162 (2024) (seventh information systems international conference, ISICO, 2023).
Wang, H., Zhu, R. & Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113, 829–844. https://doi.org/10.1080/01621459.2017.1292914 (2018).
Cunningham, P. & Delany, S. J. k-nearest neighbour classifiers - A tutorial. ACM Comput. Surv. https://doi.org/10.1145/3459665 (2021).
Kaminski, B., Jakubczyk, M. & Szufel, P. A framework for sensitivity analysis of decision trees. Central Eur. J. Oper. Res. 26, 135–159. https://doi.org/10.1007/s10100-017-0479-6 (2018).
Zhou, Y., Liang, Y. & Zhang, H. Understanding generalization error of sgd in nonconvex optimization. Mach. Learn. 111, 345–375. https://doi.org/10.1007/s10994-021-06056-w (2022).
Amir Sjarif, N. N. et al. The Fifth Information Systems International Conference, 23–24 July 2019 (Surabaya, 2019).
Karevan, Z. & Suykens, J. A. K. Transductive lstm for time-series prediction: An application to weather forecasting. Neural Netw. 125, 1–9. https://doi.org/10.1016/j.neunet.2019.12.030 (2020).
Onan, A. Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J. King Saud Univ.-Comput. Inf. Sci. 34, 2098–2117. https://doi.org/10.1016/j.jksuci.2022.02.025 (2022).
Barbulescu, V.-B., Griparis, A. & Datcu, M. A bag-of-words framework for natural disaster evaluation on sentinel-2 image. In 2020 13th International Conference on Communications (COMM). 193–196. https://doi.org/10.1109/comm48946.2020.9141955 (IEEE, Politehnica University Bucharest, Mil Tech Acad Ferdinand I, Academia Stiinte Tehnice, 2020).
Johnson, S. J., Murty, M. R. & Navakanth, I. A detailed review on word embedding techniques with emphasis on word2vec. Multimed. Tools Appl. 83, 37979–38007 (2024).
Niu, F., Wang, S. & Chen, B. A comparative study of neural sinkhorn topic model based on different word embedding. Data Sci. Inform. 4, 12–20. https://doi.org/10.1016/j.dsim.2025.01.002 (2024).
Saxena, D. & Cao, J. Generative adversarial networks (GANS): Challenges, solutions, and future directions. ACM Comput. Surv. https://doi.org/10.1145/3446374 (2022).
Elreedy, D., Atiya, A. F. & Kamalov, F. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach. Learn. 113, 4903–4923 (2024).
Mohammed, M. S. et al. Motion pattern-based scene classification using adaptive synthetic oversampling and fully connected deep neural network. IEEE Access 11, 119659–119675. https://doi.org/10.1109/ACCESS.2023.3327463 (2023).
Abayomi-Alli, O., Misra, S., Abayomi-Alli, A. & Odusami, M. A review of soft techniques for SMS spam classification: Methods, approaches and applications. Eng. Appl. Artif. Intell. 86, 197–212. https://doi.org/10.1016/j.engappai.2019.08.024 (2019).
Asirvatham, A. & Meenakshi, C. The impact of SMS phishing using machine learning classifiers with innovative techniques. Proc. Comput. Sci. 260, 608–615. https://doi.org/10.1016/j.procs.2025.03.239 (2025) (seventh international conference on recent trends in image processing and pattern recognition, RTIP2R-2024).
Gupta, M., Bakliwal, A., Agarwal, S. & Mehndiratta, P. A comparative study of spam SMS detection using machine learning classifiers. In 2018 Eleventh International Conference on Contemporary Computing (IC3), International Conference on Contemporary Computing . Jaypee Inst Informat Technol, Noida, India, Aug 02-04, 2018 (Aluru, S. et al. eds.). 287–293 (IEEE, IEEE Comp Soc Tech Comm Parallel Proc, Univ Florida Engn, 2018).
Ahmadi, M. et al. Leveraging large language models for cybersecurity: Enhancing SMS spam detection with robust and context-aware text classification. Cyber-Phys. Syst. https://doi.org/10.1080/23335777.2025.2550938 (2025).
Ghourabi, A., Mahmood, M. A. & Alzubi, Q. M. A hybrid cnn-lstm model for SMS spam detection in Arabic and English messages. Future Internet https://doi.org/10.3390/fi12090156 (2020).
Derakhshi, R. F. et al. PCLF: Parallel cnn-lstm fusion model for SMS spam filtering. In BIO Web Conference. Vol. 97. 00136. https://doi.org/10.1051/bioconf/20249700136 (2024).
Roy, P. K., Singh, J. P. & Banerjee, S. Deep learning to filter SMS spam. Future Gener. Comput. Syst. 102, 524–533. https://doi.org/10.1016/j.future.2019.09.001 (2020).
Liu, X., Lu, H. & Nayak, A. A spam transformer model for SMS spam detection. IEEE Access 9, 80253–80263. https://doi.org/10.1109/ACCESS.2021.3081479 (2021).
Abayomi-Alli, O., Misra, S. & Abayomi-Alli, A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurr. Comput.-Pract. Exp. https://doi.org/10.1002/cpe.6989 (2022).
Xia, T. & Chen, X. A discrete hidden Markov model for SMS spam detection. Appl. Sci.-Basel https://doi.org/10.3390/app10145011 (2020).
Zavrak, S. & Yilmaz, S. Email spam detection using hierarchical attention hybrid deep learning method. Expert Syst. Appl. 233, 120977. https://doi.org/10.1016/j.eswa.2023.120977 (2023).
Xu, H., Qadir, A. & Sadiq, S. Malicious SMS detection using ensemble learning and smote to improve mobile cybersecurity. Comput. Secur. 154, 104443. https://doi.org/10.1016/j.cose.2025.104443 (2025).
Giri, S., Das, S., Das, S. B. & Banerjee, S. SMS spam classification—Simple deep learning models with higher accuracy using bunow and glove word embedding. J. Appl. Sci. Eng. 26, 1501–1511. https://doi.org/10.6180/jase.202310_26(10).0015 (2023).
Anidjar, O. H., Marbel, R., Dubin, R., Dvir, A. & Hajaj, C. Extending limited datasets with GAN-like self-supervision for SMS spam detection. Comput. Secur. 145, 103998 (2024).
Rashidi, A., Salehi, M. & Najari, S. CGANS: A code-based GAN for spam detection in social media. Soc. Netw. Anal. Min. 14, 218. https://doi.org/10.1007/s13278-024-01379-7 (2024).
Wang, Z., Xu, Z. & Pan, Z. Gcc-spam: Spam detection via GAN, contrastive learning, and character similarity networks. arxiv:2507.14679 (2025).
Arroyo-Fernández, I., Méndez-Cruz, C.-F., Sierra, G., Torres-Moreno, J.-M. & Sidorov, G. Unsupervised sentence representations as word information series: Revisiting TF-IDF. Comput. Speech Lang. 56, 107–129. https://doi.org/10.1016/j.csl.2019.01.005 (2019).
Ghalyan, I. F. J., Chacko, S. M. & Kapila, V. Simultaneous robustness against random initialization and optimal order selection in bag-of-words modeling. Pattern Recognit. Lett. 116, 135–142. https://doi.org/10.1016/j.patrec.2018.09.010 (2018).
Gomes, L., da Silva Torres, R. & Côrtes, M. L. Bert- and tf-idf-based feature extraction for long-lived bug prediction in floss: A comparative study. Inf. Softw. Technol. 160, 107217. https://doi.org/10.1016/j.infsof.2023.107217 (2023).
Suleiman, D. & Al-Naymat, G. SMS spam detection using h2o framework. Proc. Comput. Sci. 113, 154–161. https://doi.org/10.1016/j.procs.2017.08.335 (2017) (the 8th international conference on emerging ubiquitous systems and pervasive networks (EUSPN 2017) / the 7th international conference on current and future trends of information and communication technologies in healthcare (ICTH-2017) / affiliated workshops).
Van Houdt, G., Mosquera, C. & Napoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 53, 5929–5955. https://doi.org/10.1007/s10462-020-09838-1 (2020).
Liu, G. & Guo, J. Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–338. https://doi.org/10.1016/j.neucom.2019.01.078 (2019).
Lu, Y., Chen, D., Olaniyi, E. & Huang, Y. Generative adversarial networks (GANS) for image augmentation in agriculture: A systematic review. Comput. Electron. Agric. https://doi.org/10.1016/j.compag.2022.107208 (2022).
Pan, Z. et al. Recent progress on generative adversarial networks (GANS): A survey. IEEE Access 7, 36322–36333. https://doi.org/10.1109/ACCESS.2019.2905015 (2019).
Jeong, J. J. et al. Systematic review of generative adversarial networks (GANS) for medical image classification and segmentation. J. Digit. Imaging 35, 137–152. https://doi.org/10.1007/s10278-021-00556-w (2022).
Acknowledgements
The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-17).
Funding
This research was funded by Taif University, Taif, Saudi Arabia, project number (TU-DSPP-2024-17).
Author information
Authors and Affiliations
Contributions
A.F: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Visualization, Project administration. M.S: Methodology, Software, Validation, Writing - review & editing, Supervision. E. A: Conceptualization, Resources, Writing - review & editing, Supervision, Project administration, Writing - original draft. M.M: Software, Validation, Investigation, Data curation, Writing - original draft. M.E:Methodology, Formal analysis, Writing - review & editing. A. B: Software, Validation, Data curation, Writing - review & editing. R.A: Resources, Writing - review & editing. A.Y: Validation, Writing - review & editing, Supervision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Filali, A., Shorfuzzaman, M., Abdellaoui Alaoui, E. et al. Cross-lingual SMS spam detection using GAN-based augmentation for imbalanced datasets. Sci Rep (2026). https://doi.org/10.1038/s41598-026-37769-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-37769-4