Table 1 The Arabic dataset that is freely available.

From: Open source Arabic research paper dataset for natural language processing

Dataset

Reference

Domain/source

# Documents

# Category

Khaleej-2004

8

News articles

5690

4

Arab news (BBC)

9

News articles

4763

7

Arab news (CNN)

9

News articles

5070

6

OSAC

9

News articles

22,429

10

KACST

10

   \(\bullet\) Saudi Press Agency (SPA)

   \(\bullet\) Saudi News Papers (SNP)

   \(\bullet\) WEB sites

   \(\bullet\) Writers

   \(\bullet\) Discussion forums

   \(\bullet\) Islamic topics

   \(\bullet\) Arabic poems

   \(\bullet\) 1526

   \(\bullet\) 4842

   \(\bullet\) 2170

   \(\bullet\) 821

   \(\bullet\) 4107

   \(\bullet\) 2243

   \(\bullet\) 1949

   \(\bullet\) 6

   \(\bullet\) 7

   \(\bullet\) 7

   \(\bullet\) 10

   \(\bullet\) 7

   \(\bullet\) 5

   \(\bullet\) 6

TALAA corpus

11

Daily Arabic newspaper websites

57,827

8

ANT corpus

12

Arab news websites

31798

6

NADA

13

News articles

7310

10

SANAD

15

News articles

190k

7

ArCLIS corpus

2

Journal articles from Library and Information Science articles

674

1

ARPD corpus

16

Journal articles from different domains.

2011

7