A data centric HitL framework for conducting a systematic error analysis of NLP datasets using explainable AI

El-Sayed, Ahmed; Nasr, Aly; Mohamed, Youssef; Alaaeldin, Ahmed; Ali, Mohab; Salah, Omar; Khalid, Abdullatif; Lazem, Shaimaa

doi:10.1038/s41598-025-13452-y

Download PDF

Article
Open access
Published: 19 August 2025

A data centric HitL framework for conducting a systematic error analysis of NLP datasets using explainable AI

Ahmed El-Sayed¹,
Aly Nasr¹,
Youssef Mohamed¹,
Ahmed Alaaeldin¹,
Mohab Ali¹,
Omar Salah¹,
Abdullatif Khalid¹ &
…
Shaimaa Lazem²

Scientific Reports volume 15, Article number: 30406 (2025) Cite this article

2469 Accesses
1 Citations
Metrics details

Subjects

Abstract

The interest in data-centric AI has been recently growing. As opposed to model-centric AI, data-centric approaches aim at iteratively and systematically improving the data throughout the model life cycle rather than in a single pre-processing step. The merits of such an approach have not been fully explored on NLP datasets. Particular interest lies in how error analysis, a crucial step in data-centric AI, manifests itself in NLP. X-Deep, a Human-in-the-Loop framework designed to debug an NLP dataset using Explainable AI techniques, is proposed to uncover data problems related to a certain task. Our case study addresses emotion detection in Arabic text. Using the framework, a thorough analysis that leveraged two Explainable AI techniques LIME and SHAP, was conducted of misclassified instances for four classifiers: Naive Bayes, Logistic Regression, GRU, and MARBERT. The systematic process has resulted in identifying spurious correlation, bias patterns, and other anomaly patterns in the dataset. Appropriate mitigation strategies are suggested for an informed and improved data augmentation plan for performing emotion detection tasks on this dataset.

Mitigating belief projection in explainable artificial intelligence via Bayesian teaching

Article Open access 10 May 2021

New explainability method for BERT-based model in fake news detection

Article Open access 08 December 2021

Language-based detection of depression with machine learning: systematic review and meta-analysis

Article Open access 24 February 2026

Introduction

Conventional AI model development prioritizes architecture design followed by optimization of hyperparameters. Advancements often involve new or enhanced architectures, with data preparation (pre-processing and augmentation) as the first and only data manipulation step. This approach assumes inherent data quality, allowing researchers to focus on tuning and improving the model as the primary variable while treating the data as constant. Model quality is assessed using accuracy and similar metrics. Low accuracy may prompt suggestions for training additional data or the exploration of more complex models. However, even with improved accuracy, the quality of the model remains questionable if the underlying data is flawed. It could be additionally misleading if the data is never scrutinized resulting in cases of rightly classified data for the wrong reasons, for instance. In these scenarios, observed accuracy gains may not reflect genuine improvement.

Data collection and annotation are expensive endeavors. The assumption underlying the conventional approach, often relying on large, perfectly labeled datasets, proves unrealistic in practice. Noise and inherent biases, which often go undetected, further complicate matters. While techniques such as data augmentation, semi-supervised learning, transfer learning, and active learning offer valuable cost-reduction strategies by automating or bypassing those expensive endeavors, further exploration of data optimization techniques is warranted. This is where the data-centric approach comes into play^1,2,3. Data-centric AI (DCAI) draws attention to the central role data plays in elevating AI model performances. It is motivated by the increased availability and democratization of optimized AI architectures and models, allowing the research community to shift their focus and bandwidth to data research². The key to DCAI is treating data as a first class citizen in the discussion on using AI in real-world applications⁴. Further, DCAI treats data as dynamic, where data refining using error analysis, augmentation, and quality assessment becomes an integrated process into the model development life cycle². The major problems plaguing data such as noisy data, dirty data, poisoned data, missing or incorrect features and labels, bias and unfairness are at the center of DCAI scholarship¹. Many of these techniques have already existed before the term was coined^2,5 but as a single step rather than embedded within a process for improving both the data and the model together. Zha et al.³ provided a general overview of the data-centric AI process, the need to use it instead of the model-centric approaches, and its’ steps - Training data development, Inference data development, and Data maintenance - to systematically engineer high quality data for building AI models. The discourse on DCAI is still growing and therefore has been abstract in nature with limited specificity on how the approach would work with various kinds of messy real-world datasets. Further, techniques more specific to AI such as Explainable AI (XAI) have not been thoroughly examined in DCAI discussions.

This research fills this gap by proposing a DCAI inspired framework for error analysis and data refinement in the domain of Arabic NLP⁶, using XAI techniques. The chosen case study is the problem of emotion detection⁷. Quality emotion detection from text is the backbone for enabling a wide class of applications, most notably chatbots that could be empathic in conversation with users⁸. Emotion detection in Arabic poses DCAI specific challenges and opportunities. Despite its popularity on the Internet, Arabic language datasets’ availability is low compared to other popular languages such as English. Arabic NLP developers often start building their models using the available minimum viable data⁹. This research demonstrates that a DCAI approach could help refine and decide on informed data augmentation efforts. Further, Arabic datasets collected from social media lack proper knowledge about the context of the data, e.g., collecting tweets using hashtags. This, in turn, makes data annotation subject to inconsistencies and unpredictable biases. The emotion detection task is even more challenging due to the inherent subjectivity in interpreting emotions. A DCAI would help developers improve data quality by unveiling inconsistencies in data annotation and potential sources of bias.

In particular, X-Deep, a Human-in-the-Loop (HitL) framework, is contributed to leverage XAI for refining and debugging NLP datasets. The framework was used to systematically analyze misclassified instances across four classifiers and examine them using two XAI techniques. This process has resulted in the identification of patterns that can improve dataset quality. To the best of our knowledge, neither the identified patterns nor the process used to uncover them has been addressed in the literature. The practical insights from this research contribute to broadening and advancing the DCAI discourse on NLP data.

The rest of the paper is organized as follows. A summary of the related work is presented in Section “Related work”. The X-Deep framework is proposed in Section “Approach”. The results are presented in Section “Experimental results”, and discussed in Section “Discussion”. The paper is concluded and recommendations for future work are outlined in Section “Conclusion and future work”.

Related Work

This section reviews the existing research efforts for developing XAI debugging frameworks and Arabic emotion detection models.

XAI Debugging Frameworks

The use of XAI has exploded in popularity in recent years¹⁰ as means to understand the inner working of black-box models and justify their decisions¹¹. The provided explanations have various uses, whether in supporting human decision making, verifying model reasoning, finding potential causes of errors, or simply building trust in models¹². This is particularly important with the growing concern regarding trustworthy AI and responsible AI as AI takes on increasingly critical roles in our lives¹³ such as in medical or safety-critical systems.

While XAI has shown promise in improving models, its application in debugging often focused on internal model workings rather than data quality¹⁴. Critics argue that current XAI methods excel at explaining simple, non-realistic scenarios¹⁵. They may overlook spurious correlations or lack a systematic approach to identify data-related issues, leading to the exploration of only anticipated data bugs¹⁴. In the field of NLP, involving human users in debugging using XAI has proven to be beneficial. Hartmann et al.¹⁶ showed how training a model with access to human explanations can improve data efficiency and model performance. Jinghui et al.¹⁷ aimed to increase the generalization of models in few-shot learning scenarios using human-identified rationales in the training process. This approach boosted the performance of models in both in-distribution and out-of-distribution datasets and made models more robust. Mosca et al.¹⁸ argued that interpretability and human participation are the fundamentals of complex NLP models and proposed a framework for real-time explanation-based interaction with NLP models. Through this framework, users provided feedback to the model’s predictions and explanations, and that feedback is then used to fine-tune the model. This approach helped in reducing bias with minimal impact on model performance. Dong-Ho et al.¹⁹ also proposed a framework for explanation-based model debugging using user feedback on task-level and instance-level explanations to guide the model to make predictions following the correct reasons. This approach led to significant improvement on model generalizability. Notably, most of the debugging frameworks explored hate-speech detection, a task where the source of potential bias can be predicted. Our work expand these efforts to emotion detection tasks, where potential sources of bias(es) or spurious correlations are challenging to predict beforehand.

Emotion Detection in Arabic NLP

Arabic’s unique linguistic features present significant complexity, especially in emotion analysis and during pre-processing due to its rich morphology, dialectical variations, cultural nuances, and diacritics and character variation. Previous research has largely focused on increasing model size in Arabic NLP, neglecting the potential benefits of customized, task-specific pre-processing techniques²⁰. The work on emotion detection from Arabic text is rather scarce, and the available annotated datasets are limited. Semi-supervised self-learning and transfer learning had shown promising results as alternative approaches²¹. Abdelwahab et al.²² analyzed sentiment in Arabic tweets about LASIK eye surgery using LSTMs and LIME²³. Interestingly, skipping common pre-processing steps improved their accuracy to 79.1%, suggesting a trade-off between simplicity and performance in this specific case. Aljameel et al.²⁴ built a sentiment analysis model to assess public awareness of COVID-19 precautions in Saudi Arabia, aiming to inform public health responses. Analyzing Twitter data during quarantine, they found the Naive Bayes model performed best (80% accuracy) and identified the southern region as most aware, while the central region showed lower awareness, aligning with confirmed COVID-19 cases. Abdelwahab et al.²⁵ addressed the challenge of multi-dialect Arabic sentiment analysis using CNN, LSTM, CNN-LSTM, and RNN models on word and character levels. They found that LSTMs perform best at word-level (79% accuracy) and CNNs at character-level (86%). Larger and more diverse datasets significantly improve performance compared to single-dialect datasets. Annotating large datasets for a comprehensive task such as emotion analysis is challenging²⁶. A DCAI approach could be helpful in refining the annotation process for smaller datasets by unveiling potential sources of data problems before expanding the annotation efforts.

Analyzing the literature shows that existing research heavily employs XAI for model interpretation or fine-tuning. Data-centric approaches, on the other hand, often overlook the potential of XAI. This work seeks to bridge this gap by leveraging XAI within a data-centric framework. The inherent complexities of Arabic language make it an ideal case study to demonstrate the value of the proposed approach.

Approach

While previous studies have used XAI tools for model debugging or post-hoc interpretation, they often treat explainability as an end rather than a means for improving the model development process. In contrast, this study introduces a novel framework that integrates HitL feedback and XAI insights to inform key stages of model development, including data pre-processing, model selection, and evaluation. Specifically, the proposed framework, X-Deep expands the use of XAI tools to perform the following.

Identify anomalies and spurious correlations in the data that may lead to biased predictions.
Compare the behavior of different model architectures to uncover strengths and weaknesses with respect to a specific dataset.
Refine training data and guide model selection based on interpretable signals.
Go beyond surface-level accuracy by ensuring that model predictions are justified by meaningful patterns in the data, in other words, they are right for the right reasons.

This comprehensive approach shows how explainability can inform and improve each stage of development, from data curation to model evaluation. We demonstrate the utility of this framework through a case study in Arabic emotion detection, a challenging task where model bias, language complexity, and limited resources make interpretability especially valuable. The ultimate objective is not solely to achieve high accuracy, but rather to improve data collection in order to develop a model that provides explainable and coherent outputs.

X-Deep, a HitL framework, to help NLP developers in analyzing datasets to detect recurring patterns and ambiguities, that we will call anomalies, which may mislead classifiers. Rather than relying solely on standard accuracy metrics, X-Deep uncovers deeper quality issues within the data. Highlighting these anomalies offers insights for pre-processing or additional data collection, avoiding brute-force methods like indiscriminate data collection or uniform pre-processing. The goal is to augment human analysis with XAI to identify anomalies², where the augmentation of human expertise was shown to reveal latent weaknesses and refine critical processes such as data acquisition and pre-processing¹², ²⁷, ²⁸.

As shown in Fig. 1, the framework begins by training models on a dataset and identifying misclassifications, using XAI to inspect and interpret these misclassifications to infer underlying data anomalies. The process can also extend to correctly classified samples suspected of flawed reasoning. We define anomaly patterns as recurring phenomena that lead to systematic misclassifications across classifiers, indicating inherent data issues such as incorrect labels, biases, or spurious correlations–problems that cannot be resolved by blindly adding more data. XAI tools are deployed to confirm these anomalies are systemic (not isolated instances) and to validate proposed mitigation strategies.

Four diverse classifiers, Naive Bayes, Logistic Regression, GRU, and MARBERT, along with two model-agnostic XAI methods, LIME²³ and SHAP²⁹, were used in the presented experiments to prevent technique-specific or classifier-specific biases in anomaly detection. An initial list of hypothesized anomalies helped initiate and guide the iterative process of anomaly detection. The list was formed based on an exploratory analysis of the dataset, including label distributions, frequent word-label associations, and manual review of the text.

A key component of X-Deep is the incorporation of human feedback. Human involvement is essential for interpreting the nuances revealed by XAI, identifying subtle patterns, and providing qualitative judgments that automated processes might miss. The framework presents XAI explanations to human participants, who then offer feedback to refine and improve the dataset. In this study, the human participants were seven of the authors of the study, all of which are native Arabic speakers with a background in computer science and familiarity with emotion detection tasks. This linguistic and technical expertise was important for interpreting the anomalies within the Arabic dataset and how they related to choices made during the data collection or pre-processing. Optimally, for the X-Deep framework to be most effective, the humans participating in the HitL process should possess relevant subject matter expertise combined with proficiency in the language of the dataset being analyzed. This combination allows for a deeper understanding of context, cultural nuances, and domain-specific issues that are essential for accurately identifying and interpreting data anomalies highlighted by XAI tools.

Experiments

The chosen dataset³⁰ is an Arabic tweets dataset that was aggregated from pre-existing datasets and and newly scrapped tweets and then manually annotated into eight emotion classes: None, anger, joy, sadness, love, sympathy, surprise, and fear. The label “none” was given to tweets that were perceived to contain “no emotion”. Four models were tested: two deep neural networks and two machine learning algorithms. For deep learning classifiers, Gated Recurrent Unit (GRU) and transformer (MARBERT)³¹ models were used. GRU was chosen for being the best performing model for sentiment analysis in related work³², while the MARBERT model was chosen for being the state of the art model for Arabic NLP tasks at the time of conducting the experiments²¹. For machine learning classifiers, Naive Bayes (NB) and Logistic Regression (LR) were picked. Those two models were chosen for being the worst case and best case performance respectively in similar NLP tasks³³.

Four main explorations were conducted: exploring the dataset, comparing the two XAI techniques, generating anomaly patterns, and comparing the four classification models. The first exploration was conducted to determine initial pre-processing; the remaining ones aim to answer the following research questions. RQ1 How do the performances of LIME and SHAP compare in investigating the nuances of the dataset? RQ2 How effective is X-Deep with respect to devising a data pre-processing and augmentation plan for refining the current dataset? RQ3 What insights can XAI provide into the performance of the four classifiers GRU, MARBERT, LR, and NB in the emotion detection task?

The details of each experiment are presented in the following subsections.

Exploring the Dataset

An exploratory data analysis phase was conducted to determine the pre-processing techniques suitable to the dataset and the purpose of the experiments to follow. This initial step involved a thorough examination of the dataset characteristics. Specifically, we analyzed the distribution of labels and performed word frequency analyses across the entire corpus. Furthermore, we investigated the prevalence of emojis, both globally and within individual label categories, to identify potential patterns and inform subsequent pre-processing decisions. Observations during this phase included instances of non-standard language usage, such as word repetition (e.g., “no no no”) and character elongation (e.g., “nooooooo”), as well as the presence of non-alphabetic characters. Notably, preliminary analysis highlighted potential challenges, such as the ambiguous labeling criteria associated with the “None” category and the skewed distribution of the most frequent word towards a single label, which could introduce bias in model training. Based on these initial findings, we formulated and evaluated several pre-processing strategies, including various stemming approaches and methods for handling emojis.

Further, transformer models, recognized as the state-of-the-art in NLP, are typically pre-trained on raw text data. To isolate the impact of pre-processing on MARBERT–a model pre-trained on raw text–we compared between two configurations: using common Arabic NLP pre-processing steps (e.g., emoji replacement, stemming, stop-word removal, and text normalization) and another retaining the original, unprocessed text. The evaluation was conducted according to the pipeline in Fig. 2, aimed to determine whether pre-processing enhances the performance for models trained on raw linguistic data, while also assessing its effect on the interpretability of XAI outcomes.

For each pre-processing configuration (including raw data), we independently fine-tuned MARBERT. Classification metrics were then compared, and mismatched predictions were analyzed using XAI. This dual approach evaluated not only predictive accuracy but also the fidelity of explanations, particularly for correct predictions driven by potential spurious correlations.

Comparison of the XAI Techniques

The goal of the experiment is to compare LIME and SHAP for inspecting the nuances of NLP models. The pipeline of the experiment follows Fig. 3. At the outset, we defined preliminary aspects–namely consistency and robustness–that we wanted to evaluate, but remained uncertain about what other behaviors to probe; therefore, we first observed how the tools behaved on a randomly selected subset of the data. The resulting explanations were then scrutinized to identify problematic or particularly interesting anomalies and behavior patterns that required deeper analysis. Subsequently, a targeted selection focused on samples with specific pre-processing or structural properties, such as the presence of repeated words or variations in text length. This targeted approach aimed to determine whether observed characteristics represented isolated occurrences or consistent patterns across similar samples (either by extracting items with matching structures or attributes directly from the dataset, or by manually generating parallel examples). The two XAI techniques were then compared across a number of different aspects.

Consistency checks for whether the XAI technique will maintain similar feature weights upon repeated application to the same sample. This is crucial as the framework relies on iterative analysis, and significant variations in weights between runs would compromise the validity of the process. This was tested by running the same sample through both tools multiple times and observing any changes in the weights assigned to words.

Robustness assesses the sensitivity of the feature weights to minor perturbations in the input. Testing was done by manually removing words, focusing especially on those with low weights, and observing the resulting effects on the weights of other words.

The handling of Repeated words by XAI tools like LIME and SHAP is notably different. By default, LIME employs a Bag-of-Words (BoW) representation, grouping all instances of a repeated word into a single feature and assigning a cumulative weight to the collection as a whole. In contrast, SHAP assigns a distinct weight to each individual occurrence of a repeated term, reflecting its specific position and contextual interactions. This difference makes it crucial to analyze how it impacts the resulting explanations.

Time complexity analysis was conducted to provide a heuristic comparison of the computational runtime of the two XAI techniques. This analysis involved examining how each tool operates in both theory and practice.Despite the availability of adequate computing resources, the iterative nature of the analysis highlights the necessity of considering computational efficiency. If the performance of the XAI techniques is comparable, a faster algorithm would be preferred, especially when the model being examined is of higher complexity. These comparisons were conducted for the four classifiers.

Generating Anomalies

The goal of this experiment is to systematically identify and document anomaly patterns within the dataset that contribute to classification errors, as shown in Table 1. A suggested list of anomalies was proposed based on initial explorations of the dataset in the previous experiments and prior experience with similar datasets. The pipeline of the experiment, illustrated in Fig. 4, is a multi-stage process designed to set and refine this list and analyze the characteristics and potential impact of the identified anomalies. Note that this experiment focuses exclusively on the identification and validation of these patterns within a dataset, rather than implementing or evaluating specific corrections for the identified anomalies.The framework proceeds through the following stages: Initial Identification and Inspection Utilizing confusion matrices generated from the models’ predictions, we selected samples that were incorrectly classified across all target labels. We placed particular focus on labels with a high frequency of misclassifications, labels that were consistently confused with specific other labels, and samples misclassified by more than one model. This broad selection strategy ensured a comprehensive pool of instances for subsequent analysis. The sample-selection stage was also guided by the list of potential anomalies in Table 1. All selected samples were then subjected to in-depth inspection using the XAI tools across all classifiers. Our goal was to understand the specific features or patterns within each instance that contributed to the classification error, thereby characterizing the anomalies present in the data. XAI-Guided Anomaly Hypothesis Testing: Each selected sample was analyzed using XAI tools, namely LIME and SHAP. By examining feature attributions across similar samples and classifiers, we pinpointed the specific linguistic or structural elements that caused each misclassification. Whenever a recurring pattern emerged–such as problems related to a specific word or dialect, it was designated as an anomaly. Hypotheses that lacked sufficient empirical support were set aside for subsequent review. Cross-Classifier Sensitivity Analysis For each newly defined anomaly, we assessed whether other classifiers in our suite exhibited similar error patterns. This step revealed model-specific blind spots and determined whether the anomaly was inherent to the data or specific to a particular model. Iterative Refinement & Validation To refine each anomaly’s definition and confirm its impact, we conducted two additional HitL passes. In the second iteration, borderline cases were reviewed by either minimally editing existing examples or crafting synthetic ones that embodied the anomaly; we then observed whether removing the anomalous feature would flip the classification outcome. A third pass served purely to validate consistency: every anomaly had to reliably reproduce the failure mode across multiple samples and classifiers. In total, we performed three annotation cycles–identification, refinement, and validation–to converge on a robust, actionable anomaly catalog.

Table 1 The list of Anomalies.

Full size table

Comparison of the Models

The goal of this experiment is to gain a better understanding of the particularities of different classification models with the Arabic emotion detection task in general by examining how the XAI techniques justify their predictions. For instance, even when all classifiers predict the same label for a specific sample, the explanations can differ significantly, let alone in cases where models predict different labels and the corresponding explanations diverge. The pipeline of the experiment follows Fig. 5. Samples where all models achieved correct predictions, where all models made incorrect predictions, and where only a single model predicted correctly or incorrectly were selected for explanation. The rationale behind this selection was to identify the unique behavioral characteristics of each model under identical circumstances. These samples were then analyzed using both LIME and SHAP across all classifiers to identify potential trends in model behavior, such as whether two models exhibit similar explanatory patterns or if their behaviors are entirely distinct. Throughout the iterative analysis, observed trends were either confirmed, refuted, or temporarily noted for further investigation based on the consistency of the findings across one or multiple classifiers.

Implementation Details

All of the experiments were run using an NVIDIA P100 GPU to accelerate training. The dataset was split into 70% for training, 15% for validation, and 15% for testing, and all models were trained and evaluated on the same splits. Furthermore, a fixed checkpoint was used for testing across all experiments to ensure consistency. Hyperparameter optimization was performed for each model to maximize the validation F1-score.

Naive Bayes

The Gaussian Naive Bayes implementation was used with variance smoothing set to \(1\times 10^{-6}\). This configuration was evaluated both alone and in combination with Linear Discriminant Analysis (LDA) using 44 components. The average inference time per sample was 0.0009 s for both configurations.

Logistic Regression

The sklearn LogisticRegression model was used with the “lbfgs” solver and the One-vs-Rest strategy. The model was trained on 86-dimensional embeddings with L2 regularization (inverse regularization strength \(C=1\)). The model took around 41 s to converge during training and the average inference time per sample was 0.0005 s.

GRU

The GRU model comprised of three layers with 128, 64, and 32 units, respectively, interleaved with 20% dropout. The token sequences were embedded into 86-dimensional vectors and processed using gated recurrences. Optimization was performed using Adam (learning rate \(6.5\times 10^{-3}\)). Training was capped at 100 epochs with early stopping (patience of 10 epochs and restoration of best weights) and mini-batches of size 64. The model converged on average after 11 epochs, requiring approximately 30 s. The average inference time per sample was 0.026 s

MARBERT

The pre-trained MARBERTv2 model was used as a base model and fine-tuned for two epochs on the dataset. A batch size of 16 and a learning rate of \(3.45\textrm{e}{-5}\) with a weight decay of 0.0156 were used. The model training took around 189.5 s for two epochs. The average inference time per sample was 0.140 s.