Enhancing Arabic healthcare fake news detection with data augmentation and multi-metric analysis using large language models

Mohamed, Ebtsam; Ismail, Walaa N.; Eldawy, Eman O.

doi:10.1038/s41598-025-21733-9

Download PDF

Article
Open access
Published: 09 February 2026

Enhancing Arabic healthcare fake news detection with data augmentation and multi-metric analysis using large language models

Ebtsam Mohamed¹,
Walaa N. Ismail^1,2 &
Eman O. Eldawy¹

Scientific Reports volume 16, Article number: 5364 (2026) Cite this article

976 Accesses
Metrics details

Subjects

Abstract

The spread of fake news about healthcare can result in a global health crisis, as it is easy to mislead the public. Detection of fake Arabic news in the healthcare sector is crucial for identifying disinformation, especially in regions where Arabic is the predominant language. Various deep learning and machine learning methods have been proposed to categorize false Arabic news related to healthcare. However, the linguistic diversity of Arabic complicates the development of effective models. Furthermore, the lack of domain-specific high-quality data makes it difficult to build accurate and reliable models. Data augmentation (DA) techniques have shown great promise in addressing these challenges. This study presents a novel technique for expanding Arabic healthcare data by conducting a multi-metric analysis to comprehensively evaluate the quality of the augmented data based on several key aspects, including label preservation, novelty, diversity, and semantic similarity. In the initial phase of our research, we investigated the impact of various data augmentation techniques on widely used classification algorithms. Additionally, similarity thresholds are systematically examined to determine their effect on the classification task. Cosine and Jaccard distances are employed to evaluate the generated sentences in terms of semantics, diversity, novelty, and label preservation. Finally, we propose a novel ensemble augmentation approach that combines multiple DA techniques to generate more varied data. Based on the overall experimental results, the proposed methodology significantly improves the classification of Arabic fake news using AraBERT, with an accuracy increase of 12.1%. In comparison, Random Forest achieved an improvement of 14.7%.

Introduction

Social networks have become increasingly popular. In 2024, more than 5.16 billion people used social networks regularly worldwide, accounting for over 59%. 3% of the population^1,2. This number is expected to reach six billion users by 2027³. The expansion of these platforms, combined with the vast reach of the Internet, has created an environment where information is shared freely, allowing diverse viewpoints to reach audiences at an unprecedented rate. This issue underscores the urgent need for robust mechanisms to verify the credibility of information in an increasingly interconnected digital landscape^4,5. Promoting public health and ensuring reliable health information necessitates the use of advanced tools, such as algorithms for detecting fake news. These algorithms are crucial in combating misinformation campaigns, supporting evidence-based decision-making, mitigating public anxiety, enhancing health literacy, increasing public trust in healthcare institutions, and fostering effective communication in public health^6,7. Accurate information is critically essential for safeguarding and improving health outcomes⁸. Machine learning (ML) is a powerful approach for detecting false news in healthcare due to its ability to perform real-time detection, feature learning, automatic detection, scalability, adaptation to changing strategies, and customization to the healthcare sector^5,8.

According to the American Language Institute⁹,, Arabic is the fifth most commonly spoken language in the world among Semitic languages. This is expected to lead to an increase in digital Arabic content on the Internet^10,11. Several unique challenges are involved in classifying fake news in Arabic datasets. Some of these challenges are attributed to Arabic’s linguistic and cultural characteristics, while others reflect the more general difficulties associated with detecting fake news. Additionally, the number of large, publicly accessible, high-quality Arabic datasets for false news identification is significantly lower than that of English datasets^10,12. The lack of labeled data makes training an accurate machine-learning model difficult. Arabic fake news data must be manually classified, a tedious and biased process, particularly when distinguishing between truth and falsehood. The learning model and the knowledge base are key elements of machine learning systems that identify fake news in Arabic. The performance of a learning model is critically dependent on the volume and quality of the training data or knowledge base. Achieving high accuracy requires training on comprehensive datasets, which poses significant challenges in domains such as Arabic fake news detection due to the scarcity of labeled data^10,13,14.

In the absence of adequate data, regularization approaches are crucial for enhancing model performance. These strategies typically involve data augmentation (DA) techniques or modifying the model’s settings. Data augmentation employs various methods to enlarge and enhance a dataset by making significant changes. The fundamental principle is to increase the quantity, diversity, and representational quality of training data^15,16,17.

By incorporating augmented samples into the training dataset, data augmentation reduces overfitting, one of the most common issues encountered when training models on sparse data^18,19. Furthermore, it generates additional instances of underrepresented classes, which helps address the class imbalance problem often encountered in fake news classification²⁰. As a result, DA is a valuable technique for improving the accuracy of machine learning models in detecting fake news in Arabic. Several of these methods are based on the concept of paraphrasing, which can be achieved through the use of transformers, thesaurus searches, or translations^10,18. A second category of proposed DA strategies involves introducing noise into sentence words using various techniques, such as word swapping, deletion, insertion, and substitution^17,20. In recent years, large models (LMs) based on the Transformer architecture, such as GPT-2²¹, BERT²², T5²³, large language models^24,25, and vision models²⁶. Time series models²⁷, which consist of billions or trillions of parameters and have been pre-trained on massive datasets, offer substantial improvements in automation and diversity in data processing, thereby reducing the need for human intervention^28,29. Data augmentation is crucial for the robustness and generalization of machine learning models, particularly for large language models (LLMs)^24,30. Using DA, LLMs can excel in tasks such as processing low-resource languages or analyzing medical text by generating diverse and semantically consistent training examples. Additionally, by introducing controlled variations, DA helps prevent overfitting, enabling LLMs to learn more resilient representations and avoid relying too heavily on spurious patterns^25,29. This has led to a proliferation of studies focused on industrial intelligence, driven by the growing interest in these capabilities. However, compared to approaches based on English data, there is a noticeable gap in research on using large language models with Arabic data^31,32,33. Furthermore, since Arabic differs from other languages, methodologies developed for other languages may not be fully applicable to Arabic textual data^34,35,36. Most proposed methods for Arabic rely on conventional data augmentation strategies, such as paraphrasing based on predefined rules and employing noising techniques^10,11. However, little research has explored methods that significantly enhance the modeling of Arabic textual data through more precise augmentations^12,13. Methods like Arabic transformers, such as AraBERT and AraGPT2, offer great potential for data augmentation. Additionally, LLMs using transformers can preserve the context of the text^37,38. This study introduces a novel approach for enriching Arabic textual data by leveraging the capabilities of contemporary large language transformer-based models, specifically AraGPT-2. In the augmentation process, GPT-2 generates additional text samples, enhancing the dataset while maintaining high relevance to the original context and semantics. Several text evaluation criteria are applied to evaluate the quality of the generated sentences. These metrics assess the context, semantics, novelty, and variety of the augmented data using various similarity measures, including cosine and Jaccard distances. The key contributions of this study are as follows:

1.
An extensive analysis is conducted to evaluate the impact of various data augmentation techniques on detecting fake news in healthcare. The analysis focuses on the quality of the generated sentences using multiple text evaluation metrics, with an emphasis on label preservation, semantic coherence, diversity, and novelty. Key performance metrics such as F1-score, cosine similarity, BERTScore, and Jaccard scores provide insight into how the augmented data enhances the model’s ability to detect fake news.
2.
Similarity thresholds are systematically studied to assess their influence on the classification task, specifically by examining how varying lexical and semantic similarity thresholds affect the model’s ability to classify fake news accurately. These experiments offer insights into the trade-offs between focusing on word-level overlap and meaning-level alignment, as well as how these approaches impact the model’s robustness and generalization capabilities, particularly in detecting nuanced or paraphrased instances of fake news.
3.
We propose a novel ensemble data augmentation (DA) approach to overcome the weak performance of large language models, such as GPT, in augmenting Arabic tweets. Our method systematically explores and implements new ensemble augmentation techniques tailored to Arabic text generation. By utilizing GPT’s transformer combined with Synonymous Word Substitution techniques for augmentation, our approach ensures that the generated text maintains semantic integrity and linguistic coherence, which are crucial for handling the complex and morphologically rich Arabic language.
4.
Our model incorporates an innovative Arabic text Data augmentation component generated using generative pre-trained transformers (GPT-2). As a result, we can capture subtle linguistic variations and domain-specific vocabulary, enhancing the variety and richness of the training data in the context of Arabic healthcare.

The rest of this paper is organized as follows. Section “Related work” presents an overview of related work. The proposed solution is presented in Sect. "Proposed methodology for Arabic healthcare fake news detection with detection with data augmentation and metric analysis". Section "Experimental results and discussion" presents our experimental results and discussion. Implications for real-world applications are discussed in Sect. "Implication for real-world Application". Finally, Sect. "Conclusion & future work" concludes the paper.

Related work

Data augmentation is an approach used in machine learning and computer vision to expand the size of a training dataset by generating new, synthetic data from the existing data^39,40. Data augmentation techniques, such as flipping and rotation, were widely used in the computer vision domain⁴¹. The Data augmentation goal is to improve the performance and generalizability of machine learning models by exposing them to a diverse range of data variations^39,41. In DA, the quantity of training data increased by applying different transformations to the training data, thereby developing novel data samples. DA can also increase the variability of the data samples to enhance the model’s performance and prediction accuracy. Furthermore, DA can address the class imbalance problem in classification learning techniques⁴².

Text augmentation

Recently, DA has been used widely in natural language processing (NLP)^43,44. Nevertheless, the discrete nature of the data in natural language processing adds more obstacles to smoothly and quickly transforming the input data. That is because the sentence’s meaning may change by varying a word. Thus, more challenging data augmentation techniques begin later in the field of natural language processing⁴⁵. Many DA techniques have been employed in various languages, with the majority being used in English. A few of these techniques augmented the data by rephrasing it using thesauruses⁴⁶, translation⁴⁷, and transformers⁴⁸. Another group of these techniques augmented the text data by adding noise to the sentences, such as deletion⁴⁹, insertion⁵⁰, and swapping words⁵¹.

Wei and Zou⁴⁹ propose EDA (Easy Data Augmentation) techniques for enhancing the performance of text classification tasks. The EDA exploits text editing techniques for data augmentation. They proposed four straightforward processes: synonym replacement, random insertion, random swap, and random deletion. In⁵², the authors propose a Prompt-based Data Augmentation model (PromDA), which primarily focuses on data augmentation for limited resources in Natural Language Understanding (NLU) tasks. PromDA trains only a set of trainable vectors within frozen Pre-trained Language Models (PLMs). This approach preserves the quality of the synthetic data generated and eliminates the need for human labor to collect unlabeled data within the domain. Furthermore, PromDA generates synthetic data from two distinct perspectives and employs NLU models to filter out lower-quality data.

Back-translation is one of the most widely used data augmentation techniques in NLP, as it is simple to implement and can generate high-quality data. The study of text translation has advanced quickly in recent years. Multiple technology companies, such as Google, have launched translation interfaces. In the Back-translation technique, the data is translated from one language into another and then returned to the original language to generate novel data⁵³.

In⁵⁴, the authors study various transformer-based pre-trained models for conditional data augmentation, including auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART). Furthermore, they measure how the data augmentation methods using several pre-trained models differ in data diversity. They also examine how using these methods can preserve the labels of the data.

In⁵⁵, the authors present AugGPT, a ChatGPT-based text data augmentation approach. AugGPT leverages ChatGPT’s capabilities to rephrase sentences in the training data into multiple linguistically distinct yet conceptually comparable ones. After that, AugGPT uses the generated data in a few-shot text classification. They compare the embedding similarity scores between the generated samples and the actual ones to validate the quality and effectiveness of the augmentation method for generating semantically similar variations. However, while AugGPT may be a robust tool for text data augmentation, it can produce incorrect augmentation results due to ChatGPT’s limited domain knowledge in some areas.

Unlike previous work that measures the quality of generated data based only on classification performance, our work measures the quality of generated data through a multi-metric analysis that includes novelty, label preservation, semantic similarity, and diversity. Table 1 compares various text augmentation methods based on their ability to preserve context, diversity, novelty, and the label of the augmented data.

Table 1 Comparison of Text Data augmentation techniques.

Subjects

Abstract

Introduction

Related work

Text augmentation

Text augmentation for Arabic language

Text augmentation for fake news

Comparison with related literature

Proposed methodology for Arabic healthcare fake news detection with detection with data augmentation and metric analysis

Dataset pre-processing and transformation

Data augmentation and measuring the quality of data

Step 1: data augmentation

Step 2: filtering augmented data based on similarity measurement

Step 3: data quality measurement

Classification and model evaluation

Experimental results and discussion

Experiment A: experiment assess the effectiveness of the augmentation process in identifying fake Arabic news

Experiment B: the impact of similarity threshold on classification performance

Experiment C: evaluating the data quality of augmentation techniques in terms of preserving label, semantics, novelty, and diversity

Experiment D: performance of text classification using ensemble augmentation-based techniques

Comparison between our proposed approach and the previous approaches

Implication for real-world application

Conclusion & future work

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links