Table 1 Overview of the dataset, including the class distribution, Gini coefficient, Language used, and source.

From: Key insights into recommended SMS spam detection datasets

Dataset

Citations

Class distribution

Gini Coefficient

Language

Source

1

26

Spam – 747

Ham – 4825

0.2179

English

https://archive.ics.uci.edu/dataset/228/sms+spam+collection

2

27

Spam – 2523

Ham − 2128

0.4998

Turkish

https://github.com/onrkrsy/TurkishSMS-Collection

3

28

Spam – 2241

Ham – 14,460

0.2911

English, German and French

https://www.kaggle.com/datasets/debapampal2002/sms-dataset1

4

Not accessible

Spam – 217

Ham – 286

0.4998

Bengali

Not accessible

5

29

Spam – 1571

Ham – 2456

0.4999

English

https://github.com/AbayomiAlli/SMS-Spam-Dataset

6

30

Spam – 1000

Ham – 1000

0.5000

English, and Hindi (Transliterated)

https://github.com/princebari/-SMS-Spam-Classification-on-Indian-Dataset-A-Crowdsourced-Collection-of-Hindi-and-English-Messages/blob/main/README.md

7

31

Spam – 2130

Ham – 2193

0.4999

Persian

https://zenodo.org/records/7832188

8

32

Spam – 574

Ham – 569

0.5000

Indonesian

https://www.kaggle.com/code/gevabriel/indonesian-sms-spam-detection-using-indobert/input

9

33

Spam – 74

Ham – 30

0.4999

Hindi

(Transliterated)

https://github.com/paulpriyam/spamTransliteration/tree/master

10

34

Spam – 107

Ham – 77

0.4999

English and Hindi

(Transliterated)

https://www.kaggle.com/datasets/uds5501/sms-dataset/data