Open source Arabic research paper dataset for natural language processing

Almutairi, Tahani M.; Saifuddin, Shireen R.; Alotaibi, Reem M.; Sarhan, Shahendah

doi:10.1038/s41598-025-16647-5

Download PDF

Article
Open access
Published: 27 August 2025

Open source Arabic research paper dataset for natural language processing

Tahani M. Almutairi¹,
Shireen R. Saifuddin¹,
Reem M. Alotaibi¹ &
…
Shahendah Sarhan^2,3

Scientific Reports volume 15, Article number: 31631 (2025) Cite this article

1547 Accesses
Metrics details

Subjects

Abstract

Recent advancements in applications such as natural language processing (NLP), applied linguistics, indexing, data mining, information retrieval, and machine translation have emphasized the need for robust datasets and corpora. While there exist many Arabic corpora, most are derived from social media platforms like X or news sources, leaving a significant gap in datasets tailored to academic research. To address this gap, the ARPD, Arabic Research Papers Dataset, is developed as a specialized resource for Arabic academic research papers. This paper explains the methodology used to construct the dataset, which consists of seven classes and is publicly available in several formats to benefit Arabic research. Experiments conducted on the ARPD dataset demonstrate its performance in classification and clustering tasks. The results show that most of the classical clustering algorithms achieve low performance compared to bio-inspiration algorithms such as Particle Swarm Optimization (PSO) and Gray Wolf Optimization (GWO) based on the Davies–Bouldin index measure. For classification, the Support Vector Machine (SVM) algorithm outperformed others, achieving the highest accuracy, with other classifiers ranging from 89% to 99%. These findings highlight the ARPD’s potential to enhance Arabic academic research and support advanced NLP applications.

A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework

Article Open access 18 August 2025

Attention-based hybrid deep learning model with CSFOA optimization and G-TverskyUNet3+ for Arabic sign language recognition

Article Open access 26 June 2025

Strategies of translating swear words into Arabic: a case study of a parallel corpus of Netflix English-Arabic movie subtitles

Article Open access 30 January 2023

Introduction

Corpora are essential tools that provide the authentic texts and data required for a wide range of applications, including NLP, Applied Linguistics, Indexing, Data Mining, Information Retrieval, and Machine Translation. These areas and tools have experienced exponential growth worldwide since the 19th century^1,2. However, the Arab world has not been able to derive maximum benefit from these tools because of the lack of specialist corpora and datasets for various domains and fields, particularly in academic areas. As noted by Eddakrouri² and Mansour³. The lack of extensive resources is a significant challenge that hinders the advancement of Arabic language research.

It is a fact that the Arabic language is severely underrepresented in the available corpora based on Eddakrouri⁴ and Guellil⁵. This indicates that Arabic requires more attention, and some domains require particular datasets to help researchers provide valuable work and applications. Furthermore, most corpora developers were concerned with the language of the media (e.g., online newspapers, magazines, and news wire agencies) rather than other domains such as academic texts. Therefore, there is an urgent need to develop specialized corpora for many domains, especially academic fields, to meet the diverse needs of the Arab world. Such resources will enable researchers, linguists, and other researchers to conduct in-depth research, enhance their understanding of Arabic, and leverage technological advancements to achieve their goals.

In addition, the spread of Arabic documents in electronic form, especially academic papers, has increased significantly in recent years as noted by Eddakrouri² and Sahmoudi et al.⁶. In the Middle East, the production of collective regional publications has grown significantly over the past four decades, from a mere 7,665 research papers in 1981 to 150,000 documents in 2019. A remarkable 20-fold increase was observed in the expansion of global production. Some of these papers are written in Arabic. This burgeoning academic production in the Arab world has sparked research and analytical interest, appropriately organizing and studying it in all aspects, especially in linguistics².

Academic and scientific studies have unique characteristics. They are characterized by artistic quality, excellent production standards, and scientific levels². Therefore, it is necessary to provide specialized data for academic papers to help researchers apply their application and algorithms that deal with academic papers with data having similar characteristics, instead of building their dataset from scratch, which is time-consuming. Researchers can benefit from public datasets or use them as a basis for any research.

This study addresses this gap by developing an open, free-source Arabic corpus designed for empirical research. The availability of an Arabic dataset for academic research papers is crucial to supporting the Arabic research community in improving quality, applying algorithms, and testing methodologies using Arabic academic datasets. Existing Arabic datasets from the literature focus on news or tweets. The ARPD dataset for academic Arabic research papers is proposed and built to improve the performance of text clusters, classification, or other NLP applications that deal with academic papers. This dataset will be used to enhance Arabic community research fields.

The ARPD dataset is a publicly available academic resource, specifically designed as a representative single-labeled Arabic dataset suitable for text classification, clustering, and other NLP tasks. It encompasses up to seven classes that have been appropriately selected to eliminate ambiguity and make it more robust for accurate text classification, clustering, or related NLP tasks. Additionally, the ARPD dataset caters to a wider range of research needs and consists of four distinct versions described in the Final Output section. Finally, the ARPD dataset provides extensive experimental results that evaluate and characterize clustering and classification performance.

The remainder of this paper is organized as follows. “Related work” section discusses related work that surveys existing Arabic corpora. “Methodology” section outlines the methodology used to construct the dataset. “ Dataset description” section describes the proposed dataset in detail. “Experiments” section outlines the experiments conducted to evaluate the dataset. “Result and discussion” section presents and discusses the experimental results. Finally, the “Conclusion” section concludes the paper.

Related work

Some Arabic datasets and corpses are available in bibliographic databases and references⁴, serving as valuable resources for researchers seeking to enhance their work. Detailed descriptions of these datasets are provided in the following paragraphs and summarized in Table 1. Several commonly used datasets are extracted from well-known news websites. One of such dataset is Khaleej-2004⁷, which consists of articles and news as detailed by Abbas et al.⁸.

In addition, Saad et al.⁹ introduced three datasets: BBC, CNN, and OSAC. The BBC and CNN datasets were Arab News, collected from bbcarabic.com and cnnarabic.com, respectively. Both datasets are available in the text and ARFF file formats. Furthermore, Open-Source Arabic Corpora (OSAC)⁹ comprises news articles and serves as a freely accessible corpus aimed at supporting research in Arabic linguistics. OSAC is one of the most comprehensive open-source Arabic corpora, containing three datasets that span diverse genres and subject domains. The corpus is available in text file format.

The Arabic text corpus, known as the King Abdulaziz City of Science and Technology (KACST) Arabic corpus, is a substantial and diverse collection. Its design criteria were well-defined, and the corpus was carefully sampled. The content is classified according to various criteria: time, region, medium, domain, and topic. This enables users to search and explore a corpus easily¹⁰.

The TALAA corpus, introduced by Selab et al.¹¹, is a collection of Arabic newspaper articles from various websites. Each category of the article is saved in a separate text file. A portion of the TALAA corpus was tagged to create an annotated Arabic corpus of approximately 7,000 tokens as parts of speech (POS). According to Guellil et al.⁵, the TALAA dataset is one of the most extensive Arabic corpora constructed from daily Arabic news articles. In addition, ANTCorpus¹², short for Arabic News Texts Corpus, is a research project that gradually seeks to collect textual data from various web sources. The corpus files are formatted in XML.

Alalyani et al.¹³ proposed a new Arabic dataset called NADA, which is intended for text categorization. This corpus consists of two existing corpora, OSAC and Diab Dataset DAA¹⁴, and contains news articles. The new corpus was pre-processed and filtered using state-of-the-art methods. It was also organized based on the Dewey decimal classification scheme and the synthetic minority oversampling technique. Finally, the corpora consisted of three files: Attribute-Relation File Format (ARFF), classified text, and sample data files.

SANAD¹⁵ is an extensive dataset of Arabic news articles collected from three viral news portals: Al-Khaleej, Al-Arabiya, and Akhbarona. This single-labeled dataset has immense potential for use in various Arabic NLP tasks, including Text Classification and Word Embedding. Articles were gathered using Python scripts specifically designed for news portals. Once collected, the text files were saved in folders, each corresponding to a specific category. In 2023, Eddakrouri² introduced the Arabic Corpus for Library and Information Science (ArCLIS), a specialized dataset in the field of Library and Information Science (LIS). This corpus contains 674 text files comprising scientific articles collected from various LIS journals. The files are available online.

In comparison to previous studies, the ARPD dataset contributes to the Arabic language field by providing a novel academic dataset that can be utilized for various applications, such as NLP models and text analysis. Unlike other datasets that focus on a single domain such as ArCLIS², the ARPD dataset contains papers from seven scientific fields written in Arabic, including Arabic language, religion, art, law, education, business, and agriculture. The novelty of this research lies in several aspects. First, we created a new dataset for the academic literature consisting of seven classes for different fields of science. Second, our generated dataset had a significant size of 2011 documents. In addition, ARPD is published in different formats, such as PDF, text, and CSV to benefit Arabic research, as described in the “Final output” section. Finally, the dataset features are described in “ Dataset description” section.

Table 1 The Arabic dataset that is freely available.

Subjects

Abstract

Similar content being viewed by others

A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework

Attention-based hybrid deep learning model with CSFOA optimization and G-TverskyUNet3+ for Arabic sign language recognition

Strategies of translating swear words into Arabic: a case study of a parallel corpus of Netflix English-Arabic movie subtitles

Introduction

Related work

Methodology

Journals selection

Data collection

Preprocessing steps

The final output

Dataset description

Experiments

Clustering algorithms

Clustering evaluation metrics

Classification algorithms

Classification evaluation metrics

Experiment setup

Results and discussion

Clustering results

Classification results

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links