Table 1 Overview of the dataset, including the class distribution, Gini coefficient, Language used, and source.
From: Key insights into recommended SMS spam detection datasets
Dataset | Citations | Class distribution | Gini Coefficient | Language | Source |
---|---|---|---|---|---|
1 | Spam – 747 Ham – 4825 | 0.2179 | English | ||
2 | Spam – 2523 Ham − 2128 | 0.4998 | Turkish | ||
3 | Spam – 2241 Ham – 14,460 | 0.2911 | English, German and French | ||
4 | Not accessible | Spam – 217 Ham – 286 | 0.4998 | Bengali | Not accessible |
5 | Spam – 1571 Ham – 2456 | 0.4999 | English | ||
6 | Spam – 1000 Ham – 1000 | 0.5000 | English, and Hindi (Transliterated) | ||
7 | Spam – 2130 Ham – 2193 | 0.4999 | Persian | ||
8 | Spam – 574 Ham – 569 | 0.5000 | Indonesian | https://www.kaggle.com/code/gevabriel/indonesian-sms-spam-detection-using-indobert/input | |
9 | Spam – 74 Ham – 30 | 0.4999 | Hindi (Transliterated) | https://github.com/paulpriyam/spamTransliteration/tree/master | |
10 | Spam – 107 Ham – 77 | 0.4999 | English and Hindi (Transliterated) |