Table 1 The Arabic dataset that is freely available.
From: Open source Arabic research paper dataset for natural language processing
Dataset | Reference | Domain/source | # Documents | # Category |
---|---|---|---|---|
Khaleej-2004 | News articles | 5690 | 4 | |
Arab news (BBC) | News articles | 4763 | 7 | |
Arab news (CNN) | News articles | 5070 | 6 | |
OSAC | News articles | 22,429 | 10 | |
KACST | \(\bullet\) Saudi Press Agency (SPA) \(\bullet\) Saudi News Papers (SNP) \(\bullet\) WEB sites \(\bullet\) Writers \(\bullet\) Discussion forums \(\bullet\) Islamic topics \(\bullet\) Arabic poems | \(\bullet\) 1526 \(\bullet\) 4842 \(\bullet\) 2170 \(\bullet\) 821 \(\bullet\) 4107 \(\bullet\) 2243 \(\bullet\) 1949 | \(\bullet\) 6 \(\bullet\) 7 \(\bullet\) 7 \(\bullet\) 10 \(\bullet\) 7 \(\bullet\) 5 \(\bullet\) 6 | |
TALAA corpus | Daily Arabic newspaper websites | 57,827 | 8 | |
ANT corpus | Arab news websites | 31798 | 6 | |
NADA | News articles | 7310 | 10 | |
SANAD | News articles | 190k | 7 | |
ArCLIS corpus | Journal articles from Library and Information Science articles | 674 | 1 | |
ARPD corpus | Journal articles from different domains. | 2011 | 7 |