Large language models robustness against perturbation

Alahmari, Saeed S.; Hall, Lawrence; Mouton, Peter R.; Goldgof, Dmitry

doi:10.1038/s41598-025-29770-0

Download PDF

Article
Open access
Published: 29 November 2025

Large language models robustness against perturbation

Saeed S. Alahmari¹,
Lawrence Hall²,
Peter R. Mouton^2,3 &
…
Dmitry Goldgof²

Scientific Reports volume 16, Article number: 346 (2026) Cite this article

2267 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across various natural language processing (NLP) tasks, including text summarization, classification, and generation. Despite their success, LLMs are primarily trained on curated datasets that lack human-induced errors, such as typos or variations in word choice. As a result, LLMs may produce unexpected outputs when processing text containing such perturbations. In this paper, we investigate the resilience of LLMs to two types of text perturbations: typos and word substitutions. Using two public datasets, we evaluate the impact of these perturbations on text generation using six state-of-the-art models, including GPT-4o and LLaMA3.3-70B. Although previous studies have primarily examined the effects of perturbations in classification tasks, our research focuses on their impact on text generation. The results indicate that LLMs are sensitive to text perturbations, leading to variations in generated outputs, which have implications for their robustness and reliability in real-world applications.

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Article Open access 14 November 2024

Using large language models in psychology

Article 13 October 2023

Large language models reflect the ideology of their creators

Article Open access 07 January 2026

Introduction

In many existing studies and applications, it is often implicitly assumed that the inputs provided to large language models (LLMs) and natural language processing (NLP) pipelines are well-structured, unambiguous, and free of noise. This assumption may hold in controlled experimental environments; however, it does not accurately reflect the nature of real-world data, which is often noisy, incomplete, or distorted due to user errors, adversarial manipulation, or data acquisition issues¹. As a result, LLMs and NLP systems that are designed and evaluated under idealized conditions may exhibit degraded performance or unpredictable behavior when deployed in more complex, dynamic, and noisy operational settings. This gap between training conditions and deployment environments highlights a critical challenge: ensuring the robustness of LLMs against various types of input perturbations. Investigating the resilience of NLP models to such perturbations is essential for building trustworthy and reliable AI systems. Without such consideration, these models may be prone to failures, biases, or misinterpretations when confronted with unexpected or corrupted inputs, potentially leading to harmful or unintended outcomes².

In real-world applications, textual noise can arise from a wide range of sources, affecting the reliability and performance of large language models (LLMs). A significant portion of this noise originates from human factors, including typographical errors, misspellings, grammatical inconsistencies, and informal language usage. Such imperfections are common in user-generated content, such as social media posts, emails, or text messages. In addition to human-related noise, automated systems also contribute to textual degradation. For instance, machine-based pipelines like optical character recognition (OCR) and automatic speech recognition (ASR) are prone to introducing transcription errors, misidentified words, or incomplete text, especially under suboptimal conditions such as low image or audio quality³.

While modern LLMs are pre-trained on vast corpora of web-based text, the precise amount and nature of noise within these pre-training datasets remain unclear. More importantly, these models are often further refined through post-training or fine-tuning on highly curated and cleaned datasets, particularly for downstream tasks such as question answering or summarization. This discrepancy between noisy pre-training data and sanitized post-training data may limit the models’ robustness, leading to reduced generalizability and performance degradation when they are deployed in real-life scenarios involving noisy inputs. Furthermore, semantic variations in input–such as the substitution of synonyms or paraphrased expressions–can significantly alter the model’s interpretation and output, particularly if the model lacks sufficient exposure to such lexical diversity during training³.

Traditional NLP pipelines typically incorporate explicit data preprocessing steps aimed at mitigating noise, including techniques such as language error correction (LEC) and spelling normalization³. Although some advances, such as subword-level and character-level embeddings, have shown improved resilience to certain types of textual noise, recent studies have demonstrated that LLMs remain vulnerable to word-level perturbations. These perturbations can cause notable shifts in the model’s output, highlighting a persistent robustness challenge even in state-of-the-art models^4,5.

Textual perturbations–such as replacing a word with its synonym, introducing typographical errors, or making small structural edits–are common in user-generated inputs provided to large language models (LLMs)⁵. These modifications, while often unintentional, can have a substantial effect on the performance and reliability of LLMs. A key reason for this sensitivity is that LLMs are typically trained on large-scale, clean, and well-curated corpora that do not fully reflect the imperfections and inconsistencies found in real-world text. As a result, even minor deviations–such as a single character substitution, misspelling, or word replacement–can cause unexpected outputs or degraded model performance. Perturbations vary in complexity, ranging from single-character noise to more nuanced lexical changes, such as paraphrasing or synonym substitution. While some models have demonstrated partial robustness to specific types of noise, many LLMs still exhibit significant performance drops when the input deviates from their expected distribution, particularly in tasks requiring grammatical integrity, semantic precision, or context-sensitive reasoning. These limitations underscore the need for enhanced robustness mechanisms that allow LLMs to maintain performance even when faced with textual variability.

LLMs have achieved impressive capabilities across a broad range of natural language tasks, including text summarization, question answering, content generation, and reasoning^6,7,8,9,10. These advancements are largely attributed to the auto-regressive training of LLMs on massive corpora that span diverse topics and styles. Despite the inclusion of both clean and perturbed data in some training pipelines to enhance generalization, empirical evaluations have shown that LLMs remain vulnerable to input noise, particularly in downstream tasks such as question answering (Q&A) and classification. While several studies have explored the robustness of LLMs in various scenarios, to the best of our knowledge, no prior work has systematically assessed the impact of text perturbations on LLMs in the context of the text generation task, where the input variations may lead to changes in both understanding and generated responses.

In this study, we present a comprehensive evaluation of LLM robustness against two specific forms of perturbations: keyboard-based typographical errors and word replacement. These perturbations were synthetically introduced into two benchmark datasets to generate perturbed counterparts of the original clean datasets. We selected six LLMs from both open-source and proprietary sources and evaluated their performance across clean and perturbed versions of the datasets. Our aim is to quantify the degradation in model accuracy due to text perturbations and to analyze the degree to which different models are resilient to such variations. The remainder of the paper is organized as follows: the next section reviews relevant background and related work; this is followed by a detailed explanation of our evaluation methodology; finally, we present our experimental results and offer a critical discussion of the findings.

Background

In this section, we highlights existing approaches for assessing the robustness of LLMs to perturbation in the prompts. Wang et al. assess LLMs robustness to perturbation, using a subset of the Natural Questions dataset¹¹ (1k questions) to assess robustness⁵. The perturbation used by the authors include three levels of word perturbation. The perturbations levels are misspelling (level 1), swapping words (level 2), and replacing words with their synonyms (level 3). The authors used the Beaver-7B Reward Model¹² and its associated cost model to evaluate the similarity between the LLMs response from a clean prompt and perturbed prompt. The authors concluded that the LLMs models were exhibiting a vulnerability to word-level perturbation.

Another work by Rauba et al. proposed a framework for assessing the robustness of LLMs to perturbations called Distribution-Based Perturbation Analysis (DBPA)¹³. This framework is based on using the distribution of the LLMs output to measure robustness. This approach was assessed based on perturbation of the input prompt and the LLM model itself when fine-tuned (system perturbation). They observed variation in the response when applying prompt perturbations.

Alvarado et al investigated the resilience of large language models (LLMs) to various text perturbations, such as typos, synonym substitutions, and structural changes, across multiple natural language processing (NLP) tasks¹⁴. The authors analyzed the performance of four BERT-based models–DistilBERT, ELECTRA, XLNet, and Funnel Transformer–on tasks including grammatical coherence, sentiment analysis, and hate speech detection. They categorized perturbations into character-level, word-level, and miscellaneous changes and evaluated model robustness using Cohen’s kappa coefficient. The findings revealed that LLMs exhibit varying degrees of vulnerability, with character-level perturbations having the most disruptive impact. Among the models tested, XLNet demonstrated the highest resilience, particularly in tasks requiring complex linguistic reasoning. The study highlights the necessity for models to adapt to real-world text irregularities to maintain reliable performance.

An enhancement of LLM robustness to perturbed instructions was proposed by Agrawal et al, where the authors experimented and compared different approaches such as iterative self-denoising, perplexity smoothing, instruction sampling, and representation alignment. The authors found that self-denoising enhances the robustness of the LLM against perturbation of the prompts¹⁵. The authors experimented with two LLMs: LLaMA-3 and Flan-T5 using three classification datasets from the GLUE benchmark¹⁶: CoLA¹⁷, QNLI¹⁸, and SST-2¹⁹. Another approach that shows the robustness of LLMs against the perturbation of text prompts using cosine similarity metrics as a measure of the semantic meaning between the LLM’s response for the noisy prompts and the clean prompt³. This approach was evaluated on three datasets using three LLMs, including: GPT4, LLaMA-3 7B, and BERT.

A distribution-based perturbation analysis approach for quantifying the perturbation impact on LLMs was proposed by Rauba et al¹³. This approach uses Monte Carlo sampling to construct a low-dimensional semantic similarity space. NLPerturbed framework was proposed by Chen et al.²⁰ for evaluating the perturbation of 18 different categories of perturbation for a given set of prompts. This study focuses on the robustness of code generation against perturbation of prompts. The authors showed that perturbed prompts decrease the performance of code generation.

Another approach that shows the fragile state of the LLMs’ safeguard against perturbation was proposed in²¹. This approach shows that appending a space or a string of characters to the end of the original prompt could override the training behavior to refuse to reply to unsafe prompts. This approach emphasizes the importance of addressing the robustness of LLMs against perturbation. A work by Chaudhary et al.²² to evaluate the Google Gemini 1 robustness against perturbation and compare the performance with human performance for providing a score for text datasets. This work used two dataset, SummEval²³ and USR²⁴, and showed that Google Gemini 1 falls short with the perturbations injected in the prompts.

Experiment

Dataset

In this experiment, we utilized two publicly accessible datasets to evaluate sentiment analysis models. The first dataset is the Amazon Reviews dataset, which can be obtained from the HuggingFace platform^{Footnote 1}. This dataset comprises a collection of user-generated textual reviews for a wide range of Amazon products. It contains a total of 4,920 individual entries, each representing a single review written by a customer. These reviews encapsulate subjective feedback, opinions, and sentiments regarding the quality and performance of various products, making the dataset suitable for sentiment classification tasks.

The second dataset employed in our study is a compact version of the IMDb movie reviews dataset, also available through HuggingFace^{Footnote 2}. This dataset includes 3,000 movie reviews authored by users, each comprising a single entry of free-text feedback about a particular film. These reviews often express nuanced emotional reactions and personal evaluations, offering rich contextual information for training and evaluating sentiment analysis models. The linguistic diversity and varying intensity of sentiment present in this dataset make it particularly valuable for developing models capable of understanding and interpreting natural language sentiment.

Large language models

In this study, we have used six Large Language Models (LLMs), which are: Gemma3 20B, Gemma3 4B, GPT 4o 12B, and Llama3.2 4B, Llama3.3 70B, and Phi 3B. All the models were used locally through the Olama software package, except GPT-4o, and Llama 3.3 70B for which we used APIs including the OpenAI API and the Togther.ai API. This selection spans a range of architectures and parameter sizes, enabling a comprehensive analysis of how model scale and design impact resilience to input noise. Each LLM was prompted with a single text at a time (as detailed in the experiment below), while setting the temperature parameter to zero, to ensure that we obtained the same response given the same input prompt and to eliminate the stochastic sampling of output words.

Method

To simulate realistic input variations, we introduced two types of perturbations: keyboard typos and word replacement. These were applied using the Textflint library, following the methodology described in Romero et al. (2024)¹⁴ and Wang et al. (2021)²⁵. The keyboard typo perturbation randomly replaces characters in words with those on adjacent keys, often leading to malformed or misspelled tokens, which can obscure or distort meaning. In contrast, the word replacement mostly maintains semantic intent while varying surface form by replacing words with their lexical equivalents–reflecting the variability commonly seen in user-generated content. The Textflint package uses the WordNet database for finding the synonyms of words.

All models were evaluated using a consistent sentiment analysis prompt template: “Write two sentences at maximum describing the sentiment in the following text: [Review].” Here, [Review] was replaced by each review sample in our dataset. Table 1 shows representative examples of both original and perturbed reviews. This setup allows us to measure how well the models generalize across both semantically-preserving and semantically-altering changes, revealing insights into their practical robustness under noisy or variable input conditions.

To evaluate the resilience of an LLM denoted by f, we iterated over each original sample \(x_o\) in the datasets \(D_1\) and \(D_2\), and applied the perturbation step (i.e., keyboard and word replacement) to get the perturbed text \(x_p\). Then the response r from the LLM f is received by \(r_o = f(x_o)\), where \(r_o\) represents the response when the LLM f is prompted with the original text sample \(x_o\). Similarly, for the perturbed text \(x_p\), we prompt the LLM f such that \(r_p = f(x_p)\), where \(r_p\) is the response of the LLM when prompted with a perturbed text \(x_p\). In other words, LLM model outputs were compared between original and perturbed inputs to quantify resilience to both semantic-preserving and semantic-altering variations. This experimental design enables a controlled evaluation of large language models’ ability to generalize under realistic, noisy input conditions. In Figure 1, we show the pipeline of our evaluation approach of the resilience of text generation LLMs against text perturbations.

Table 1 Examples of the review texts in the dataset and the perturbation resulting from using the keyboard and word replacement approach generated using the textflint package.

Full size table

Evaluation

To systematically assess the robustness of the six large language models (LLMs) when exposed to perturbed text prompts, we employed a comprehensive evaluation framework utilizing three widely adopted performance metrics: BLEU score, ROUGE-L F1 score, and Jaccard Index. These metrics collectively capture different dimensions of output quality, including n-gram overlap, recall-oriented summarization accuracy, and set-based similarity, respectively. This multi-metric approach enables a more nuanced understanding of each model’s sensitivity to textual perturbations, facilitating a robust comparative analysis across LLM variants.

The BLEU score (Bilingual Evaluation Understudy) is a precision-based metric originally developed for evaluating machine translation systems²⁶. It measures the overlap of n-grams between the generated LLM response of the original prompt and the LLM response of the perturbed prompt, with higher scores indicating closer alignment to the original prompt-based response. While commonly used in translation, it has also been widely adopted for assessing the lexical similarity of text generated by language models²⁷. The ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation – Longest Common Subsequence) score, on the other hand, emphasizes recall by evaluating the longest common subsequence between a candidate and a reference text, making it particularly effective for summarization and tasks where capturing content coverage is essential²⁸. Finally, the Jaccard Index quantifies the similarity between two sets by dividing the size of their intersection by the size of their union. When applied to text, it typically compares sets of words, offering a straightforward measure of lexical overlap²⁹. Together, these metrics provide a complementary evaluation suite that captures different aspects of similarity and divergence between generated and reference texts.

Results and discussion

Table 2 Performance metrics for different LLMs under keyboard and word replacement perturbation using amazon review dataset.

Full size table

Table 3 Performance metrics for different LLMs under Keyboard and word replacement perturbation using IMDb movie review dataset.

Full size table

The results presented in Table 2 compare the performance of six large language models (LLMs)–Gemma3-20b, Gemma3-4b, GPT-4o-12b, Llama3.2-4b, Llama3.3-70b, and Phi-3b–under two types of text perturbations: keyboard perturbations and word substitutions, using the Amazon review dataset. Across the board, GPT-4o 12b consistently outperforms the other models, achieving the highest scores in all three metrics (BLEU, ROUGE-L, and Jaccard Index) under both perturbation types. For instance, under keyboard perturbation, it records a BLEU score of 0.466 ± 0.246, a ROUGE-L score of 0.691 ± 0.169, and a Jaccard Index of 0.555 ± 0.203. This indicates weak robustness in maintaining semantic and lexical alignment with the original text-based LLM response, after introducing the keyboard and word replacement-based perturbation noise.

In contrast, lighter models such as Phi-3b and Gemma3-4b exhibit weaker performance, with significantly lower scores across all metrics. This performance gap highlights the advantage of larger-scale models in handling perturbed inputs more effectively, likely due to their greater parameter capacity and training data exposure. Among the Llama models, Llama3.3 70b shows a slight resilience, especially in word replacement perturbations, where it nearly rivals GPT-4o with a ROUGE-L score of 0.661 \(\pm\) \(\pm\) 0.183 and a Jaccard Index of 0.529 ± 0.213. Overall, the results suggest that model size and architectural advancements influence robustness against input noise, and GPT-4o leads in maintaining robustness under both forms of perturbation. However, all results are considered very weak and show that perturbations have a significant effect on the LLM’s performance. More detailed results are presented in Table 2.

Table 3 presents a comparative analysis of the robustness of six large language models (LLMs)–Gemma3-20b, Gemma3-4b, GPT-4o-12b, Llama2-4b, Llama3.3-70b, and Phi-3b–under two types of input perturbations: keyboard-based noise and word replacement perturbation, evaluated using the IMDb movie review dataset. The evaluation metrics include the BLEU score, ROUGE-L F1 score, and Jaccard Index, each reflecting different aspects of text similarity and semantic preservation. Among the models, GPT-4o-12b, Llama2-4b, and Llama3.3-70b consistently achieved superior results across all metrics and perturbation types, indicating a high degree of robustness and semantic fidelity under noisy input conditions. Notably, GPT-4o-12b achieved the highest Jaccard Index (0.570 ± 0.213) and ROUGE-L score (0.696 ± 0.175) under keyboard perturbation, while Llama3.3-70b achieved the highest BLEU score (0.541 ± 0.256), underscoring its capability to maintain both lexical and semantic integrity despite noise.

Under word replacement perturbation, the top-performing models remained consistent, with Llama3.3-70b, GPT-4o-12b, and Llama2-4b again leading the group. GPT-4o-12b and Llama3.3-70b performed particularly well on ROUGE-L (0.675 ± 0.170 and 0.671 ± 0.187, respectively) and Jaccard Index (0.544 ± 0.202 and 0.548 ± 0.216, respectively), which are indicative of their ability to preserve semantic meaning despite lexical changes. In contrast, smaller models like Gemma3-4b and Phi-3b showed relatively lower performance across all metrics, particularly in BLEU and Jaccard Index, suggesting greater sensitivity to perturbations. Overall, these results emphasize the enhanced resilience of larger and more advanced LLMs in handling noisy or altered inputs and validate their potential for real-world applications involving imperfect or adversarial text data. However, there is a significant gap in improving LLMs’ resilience against perturbed input.

In Fig. 2, we present a detailed visual analysis comparing the average performance (mean) and variability (standard deviation) of each large language model (LLM) on the Amazon Review dataset (Dataset 1). The results indicate that, under the keyboard typo-based perturbation, the GPT-4o 12B model achieves the highest Rouge-L F1 score, demonstrating superior robustness and consistency in handling such input noise. This is followed closely by the LLaMA3.3 70B, which also shows strong performance, and then the Gemma3 20B model. Interestingly, a comparable performance pattern is observed when the models are evaluated using word substitution-based perturbations. This suggests that the models’ relative performance rankings remain stable across different types of semantic and lexical disturbances, reinforcing the reliability of GPT-4o 12B in maintaining high-quality output despite variations in input text.

In Fig. 3, we illustrate and compare the mean Rouge-L F1 scores and standard deviations for the same set of LLMs, this time on the IMDb Movie Review dataset (Dataset 2). The results again highlight GPT-4o 12B as the top-performing model under keyboard typo perturbations, with LLaMA3.3 70B and LLaMA3.2 4B coming in second and third, respectively. For the word replacement perturbation in this dataset, the performance trend remains consistent: GPT-4o 12B continues to lead, followed by LLaMA3.3 70B and LLaMA3.2 4B. These consistent outcomes across both datasets and perturbation types emphasize the robustness and adaptability of GPT-4o 12B, making it a reliable choice for tasks involving noisy or altered input text.

One limitation of utilizing APIs to run large language models–such as the OpenAI API for GPT-4o or the Together.ai API for LLaMA3.3 70B–is the lack of control over the underlying hardware resources. Specifically, there is no guarantee that each prompt will be processed using the same machine configuration, which may introduce variability in the generated outputs, even when the temperature is fixed at zero. Additionally, these APIs may apply different quantization techniques behind the scenes, which can further impact the model’s repeatability and robustness. Such inconsistencies have been highlighted in prior work³⁰, emphasizing the challenges in ensuring reproducible and stable performance when relying on third-party API access for LLM inference. (Table 4)

As illustrated in Table 1, certain words lose their semantic integrity following perturbations, rendering them meaningless or difficult to interpret. This observation underscores the importance of examining how tokenization is handled by different models. To investigate this, we analyzed the tokenization behavior of both GPT-4o and LLaMA3.3 70B. For example, the original phrase “no issues” is tokenized identically by both models:– demonstrating a clean and semantically coherent split. However, when we introduce a typographical error, changing the phrase to “no isskes”, the tokenization behavior diverges. In the case of LLaMA3.3 70B, the perturbed input is broken into: – indicating a fragmented interpretation where the model may struggle to reconstruct the intended meaning. On the other hand, GPT-4o tokenizes the same perturbed phrase as: – reflecting a slightly more cohesive segmentation, though still deviating from the original word structure.

Table 4 Examples of GPT 4o responses to original and perturbed prompts, text review, and the corresponding Jaccard Similarity index score. Words in bold show the sentiment predicted by GPT-4o.

Full size table

These differences highlight a critical point: when models encounter unfamiliar or distorted words, their tokenizers tend to break them into smaller subword units. This mechanism allows the model to potentially focus on partial patterns or recognizable components of the input. Nonetheless, there is a trade-off–if the perturbed token alters the meaning significantly, the model might misinterpret or overlook crucial semantic content. This can lead to degradation in performance, particularly in tasks where word-level precision is essential.

As illustrated in Table 4, when the GPT-4o model is provided with keyboard-perturbed review texts as input, the generated responses–shown in the third column–often exhibit a shift in the interpreted sentiment, particularly when the Jaccard similarity index is low. Conversely, higher Jaccard similarity scores correspond to greater consistency between the original and perturbed inputs, with the model producing similar responses and maintaining the same sentiment interpretation. For instance, the first row shows a shift in the sentiment from positive to neutral by applying keyboard perturbation shown in the first row, second column.

The Textflint package has been proposed in²⁵, and has been used in¹⁴. However, we have observed that the words that TextFlint used for the synonyms approach are not exact synonyms. For instance, “loves” and “likes” are related; however, the two words are not synonyms. Therefore, our future research will study the effect of the word replacement using exact synonyms.

Conclusion

In this study, we conducted a comprehensive evaluation of the robustness of six large language models (LLMs) when exposed to two categories of textual perturbations: typographical errors and word replacement perturbations within user prompts. Our experimental findings reveal that these perturbations can impact model performance, indicating that current LLMs exhibit limited resilience to input variations that commonly occur in real-world applications. Specifically, models showed differences in the responses compared to the clean prompt-based responses when presented with typo-induced distortions. Similarly, substituting words with other relevant words also led to inconsistencies in model responses, suggesting a sensitivity to lexical variations that alter surface-level token structures.

Overall, our findings emphasize the importance of improving LLM robustness to enhance their reliability and generalizability in real-world scenarios where user inputs are rarely clean or consistent. Future work should focus on developing perturbation-invariant tokenization strategies and training regimes that explicitly account for noisy or altered inputs.

Data availability

The datasets used in this manuscript were obtained from two public sources: 1) Amazon review dataset https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis 2) IMDb movie review dataset https://huggingface.co/datasets/noob123/imdb_review_3000 We have made the original text data, the perturbed versions, and the experimental code available in our GitHub repository. The dataset can be found in the “data” directory at the following link: https://github.com/saeedalahmari3/LLM_Robustness.git

Notes

https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis
https://huggingface.co/datasets/noob123/imdb_review_3000

References

Wu, D., Chen, Y., Ding, L. & Tao, D. Bridging the gap between clean data training and real-world inference for spoken language understanding. arXiv preprint arXiv:2104.06393 (2021).
Jones, K. S. & Galliers, J. R. Evaluating natural language processing systems (Springer Science & Business Media, Berlin Heidelberg, 1995).
Google Scholar
Singh, A., Singh, N. & Vatsal, S. Robustness of llms to perturbations in text. arXiv preprint arXiv:2407.08989 (2024).
Srivastava, A., Makhija, P. & Gupta, A. Noisy text data: Achilles’ heel of bert. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), 16–21 (2020).
Wang, H. et al. Are large language models really robust to word-level perturbations? Transactions on Machine Learning Research (2023).
Fang, J. et al. Multi-llm text summarization. arXiv preprint arXiv:2412.15487 (2024).
Zhuang, Y., Yu, Y., Wang, K., Sun, H. & Zhang, C. Toolqa: A dataset for llm question answering with external tools. Adv. Neural Inf. Process. Syst. 36, 50117–50143 (2023).
Google Scholar
Abburi, H. et al. Generative ai text classification using ensemble llm approaches. arXiv preprint arXiv:2309.07755 (2023).
Wei, F. et al. Empirical study of llm fine-tuning for text classification in legal document review. In 2023 IEEE international conference on big data (BigData), 2786–2792 (IEEE, 2023).
Zhang, Y., Wang, M., Li, Q., Tiwari, P. & Qin, J. Pushing the limit of llm capacity for text classification. In Companion Proceedings of the ACM on Web Conference 2025, 1524–1528 (2025).
Kwiatkowski, T. et al. Natural questions: A benchmark for question answering research. Trans. Assoc. Computational Linguist. 7, 453–466 (2019).
Article Google Scholar
Ji, J. et al. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Adv. Neural Inf. Process. Syst. 36, 24678–24704 (2024).
Google Scholar
Rauba, P., Wei, Q. & van der Schaar, M. Quantifying perturbation impacts for large language models. arXiv preprint arXiv:2412.00868 (2024).
Romero-Alvarado, D., Hernández-Orallo, J. & Martínez-Plumed, F. How resilient are language models to text perturbations? In International Conference on Intelligent Data Engineering and Automated Learning, 85–96 (Springer, 2024).
Agrawal, A., Alazraki, L., Honarvar, S. & Rei, M. Enhancing llm robustness to perturbed instructions: An empirical study. In ICLR 2025 Workshop on Building Trust in Language Models and Applications (2025).
Wang, A. et al. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (2018).
Warstadt, A., Singh, A. & Bowman, S. R. Neural network acceptability judgments. Trans. Assoc. Computat. Linguist. 7, 625–641 (2019).
Article Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2016).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631–1642 (2013).
Chen, J., Zhenhao, L., Xing, H. & Xin, X. Nlperturbator: Studying the robustness of code llms to natural language variations. ACM Transactions on Software Engineering and Methodology (2024).
Lin, L., Brown, H., Kawaguchi, K. & Shieh, M. Single character perturbations break llm alignment. In Proceedings of the AAAI Conference on Artificial Intelligence 39, 27473–27481 (2025).
Chaudhary, M., Gupta, H., Bhat, S. & Varma, V. Towards understanding the robustness of llm-based evaluations under perturbations. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), 197–205 (2024).
Fabbri, A. R. et al. Summeval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9, 391–409 (2021).
Article Google Scholar
Mehri, S. & Eskenazi, M. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2020).
Wang, X. et al. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 347–355 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318 (2002).
Wieting, J., Berg-Kirkpatrick, T., Gimpel, K. & Neubig, G. Beyond bleu: Training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2019).
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81 (2004).
Verma, V. & Aggarwal, R. K. A comparative analysis of similarity measures akin to the jaccard index in collaborative recommendations: Empirical and theoretical perspective. Social Netw. Anal. Mining 10, 43 (2020).
Article Google Scholar
Alahmari, S. S., Hall, L. O., Mouton, P. R. & Goldgof, D. B. Repeatability of fine-tuning large language models illustrated using qlora. IEEE Access. 12, 153221–153231 (2024).
Article Google Scholar

Download references

Acknowledgements

Nvidia supported this research with a GPU. The USF Strategic Investment Pool supported this research project.

Funding

The Deanship of Graduate Studies and Scientific Research at Najran University funded this work under the Growth Funding Program grant code (NU/GP/SERC/13/151-2).

Author information

Authors and Affiliations

Department of Computer Science, Najran University, Najran, Saudi Arabia
Saeed S. Alahmari
Department of Computer Science and Engineering, University of South Florida, Tampa, FL, USA
Lawrence Hall, Peter R. Mouton & Dmitry Goldgof
SRC Biosciences, Tampa, FL, USA
Peter R. Mouton

Authors

Saeed S. Alahmari
View author publications
Search author on:PubMed Google Scholar
Lawrence Hall
View author publications
Search author on:PubMed Google Scholar
Peter R. Mouton
View author publications
Search author on:PubMed Google Scholar
Dmitry Goldgof
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S.A.: contributed to the experiment design, implementation, dataset validation, assessment, and writing. L.H.: contributed to the experiment design, assessment, and writing. P.R.M.: contributed to the experiment design and writing. D.G.: contributed to the experiment design, assessment, and writing.

Corresponding author

Correspondence to Saeed S. Alahmari.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Alahmari, S.S., Hall, L., Mouton, P.R. et al. Large language models robustness against perturbation. Sci Rep 16, 346 (2026). https://doi.org/10.1038/s41598-025-29770-0

Download citation

Received: 14 September 2025
Accepted: 19 November 2025
Published: 29 November 2025
Version of record: 05 January 2026
DOI: https://doi.org/10.1038/s41598-025-29770-0