Introduction

In many existing studies and applications, it is often implicitly assumed that the inputs provided to large language models (LLMs) and natural language processing (NLP) pipelines are well-structured, unambiguous, and free of noise. This assumption may hold in controlled experimental environments; however, it does not accurately reflect the nature of real-world data, which is often noisy, incomplete, or distorted due to user errors, adversarial manipulation, or data acquisition issues1. As a result, LLMs and NLP systems that are designed and evaluated under idealized conditions may exhibit degraded performance or unpredictable behavior when deployed in more complex, dynamic, and noisy operational settings. This gap between training conditions and deployment environments highlights a critical challenge: ensuring the robustness of LLMs against various types of input perturbations. Investigating the resilience of NLP models to such perturbations is essential for building trustworthy and reliable AI systems. Without such consideration, these models may be prone to failures, biases, or misinterpretations when confronted with unexpected or corrupted inputs, potentially leading to harmful or unintended outcomes2.

In real-world applications, textual noise can arise from a wide range of sources, affecting the reliability and performance of large language models (LLMs). A significant portion of this noise originates from human factors, including typographical errors, misspellings, grammatical inconsistencies, and informal language usage. Such imperfections are common in user-generated content, such as social media posts, emails, or text messages. In addition to human-related noise, automated systems also contribute to textual degradation. For instance, machine-based pipelines like optical character recognition (OCR) and automatic speech recognition (ASR) are prone to introducing transcription errors, misidentified words, or incomplete text, especially under suboptimal conditions such as low image or audio quality3.

While modern LLMs are pre-trained on vast corpora of web-based text, the precise amount and nature of noise within these pre-training datasets remain unclear. More importantly, these models are often further refined through post-training or fine-tuning on highly curated and cleaned datasets, particularly for downstream tasks such as question answering or summarization. This discrepancy between noisy pre-training data and sanitized post-training data may limit the models’ robustness, leading to reduced generalizability and performance degradation when they are deployed in real-life scenarios involving noisy inputs. Furthermore, semantic variations in input–such as the substitution of synonyms or paraphrased expressions–can significantly alter the model’s interpretation and output, particularly if the model lacks sufficient exposure to such lexical diversity during training3.

Traditional NLP pipelines typically incorporate explicit data preprocessing steps aimed at mitigating noise, including techniques such as language error correction (LEC) and spelling normalization3. Although some advances, such as subword-level and character-level embeddings, have shown improved resilience to certain types of textual noise, recent studies have demonstrated that LLMs remain vulnerable to word-level perturbations. These perturbations can cause notable shifts in the model’s output, highlighting a persistent robustness challenge even in state-of-the-art models4,5.

Textual perturbations–such as replacing a word with its synonym, introducing typographical errors, or making small structural edits–are common in user-generated inputs provided to large language models (LLMs)5. These modifications, while often unintentional, can have a substantial effect on the performance and reliability of LLMs. A key reason for this sensitivity is that LLMs are typically trained on large-scale, clean, and well-curated corpora that do not fully reflect the imperfections and inconsistencies found in real-world text. As a result, even minor deviations–such as a single character substitution, misspelling, or word replacement–can cause unexpected outputs or degraded model performance. Perturbations vary in complexity, ranging from single-character noise to more nuanced lexical changes, such as paraphrasing or synonym substitution. While some models have demonstrated partial robustness to specific types of noise, many LLMs still exhibit significant performance drops when the input deviates from their expected distribution, particularly in tasks requiring grammatical integrity, semantic precision, or context-sensitive reasoning. These limitations underscore the need for enhanced robustness mechanisms that allow LLMs to maintain performance even when faced with textual variability.

LLMs have achieved impressive capabilities across a broad range of natural language tasks, including text summarization, question answering, content generation, and reasoning6,7,8,9,10. These advancements are largely attributed to the auto-regressive training of LLMs on massive corpora that span diverse topics and styles. Despite the inclusion of both clean and perturbed data in some training pipelines to enhance generalization, empirical evaluations have shown that LLMs remain vulnerable to input noise, particularly in downstream tasks such as question answering (Q&A) and classification. While several studies have explored the robustness of LLMs in various scenarios, to the best of our knowledge, no prior work has systematically assessed the impact of text perturbations on LLMs in the context of the text generation task, where the input variations may lead to changes in both understanding and generated responses.

In this study, we present a comprehensive evaluation of LLM robustness against two specific forms of perturbations: keyboard-based typographical errors and word replacement. These perturbations were synthetically introduced into two benchmark datasets to generate perturbed counterparts of the original clean datasets. We selected six LLMs from both open-source and proprietary sources and evaluated their performance across clean and perturbed versions of the datasets. Our aim is to quantify the degradation in model accuracy due to text perturbations and to analyze the degree to which different models are resilient to such variations. The remainder of the paper is organized as follows: the next section reviews relevant background and related work; this is followed by a detailed explanation of our evaluation methodology; finally, we present our experimental results and offer a critical discussion of the findings.

Background

In this section, we highlights existing approaches for assessing the robustness of LLMs to perturbation in the prompts. Wang et al. assess LLMs robustness to perturbation, using a subset of the Natural Questions dataset11 (1k questions) to assess robustness5. The perturbation used by the authors include three levels of word perturbation. The perturbations levels are misspelling (level 1), swapping words (level 2), and replacing words with their synonyms (level 3). The authors used the Beaver-7B Reward Model12 and its associated cost model to evaluate the similarity between the LLMs response from a clean prompt and perturbed prompt. The authors concluded that the LLMs models were exhibiting a vulnerability to word-level perturbation.

Another work by Rauba et al. proposed a framework for assessing the robustness of LLMs to perturbations called Distribution-Based Perturbation Analysis (DBPA)13. This framework is based on using the distribution of the LLMs output to measure robustness. This approach was assessed based on perturbation of the input prompt and the LLM model itself when fine-tuned (system perturbation). They observed variation in the response when applying prompt perturbations.

Alvarado et al investigated the resilience of large language models (LLMs) to various text perturbations, such as typos, synonym substitutions, and structural changes, across multiple natural language processing (NLP) tasks14. The authors analyzed the performance of four BERT-based models–DistilBERT, ELECTRA, XLNet, and Funnel Transformer–on tasks including grammatical coherence, sentiment analysis, and hate speech detection. They categorized perturbations into character-level, word-level, and miscellaneous changes and evaluated model robustness using Cohen’s kappa coefficient. The findings revealed that LLMs exhibit varying degrees of vulnerability, with character-level perturbations having the most disruptive impact. Among the models tested, XLNet demonstrated the highest resilience, particularly in tasks requiring complex linguistic reasoning. The study highlights the necessity for models to adapt to real-world text irregularities to maintain reliable performance.

An enhancement of LLM robustness to perturbed instructions was proposed by Agrawal et al, where the authors experimented and compared different approaches such as iterative self-denoising, perplexity smoothing, instruction sampling, and representation alignment. The authors found that self-denoising enhances the robustness of the LLM against perturbation of the prompts15. The authors experimented with two LLMs: LLaMA-3 and Flan-T5 using three classification datasets from the GLUE benchmark16: CoLA17, QNLI18, and SST-219. Another approach that shows the robustness of LLMs against the perturbation of text prompts using cosine similarity metrics as a measure of the semantic meaning between the LLM’s response for the noisy prompts and the clean prompt3. This approach was evaluated on three datasets using three LLMs, including: GPT4, LLaMA-3 7B, and BERT.

A distribution-based perturbation analysis approach for quantifying the perturbation impact on LLMs was proposed by Rauba et al13. This approach uses Monte Carlo sampling to construct a low-dimensional semantic similarity space. NLPerturbed framework was proposed by Chen et al.20 for evaluating the perturbation of 18 different categories of perturbation for a given set of prompts. This study focuses on the robustness of code generation against perturbation of prompts. The authors showed that perturbed prompts decrease the performance of code generation.

Another approach that shows the fragile state of the LLMs’ safeguard against perturbation was proposed in21. This approach shows that appending a space or a string of characters to the end of the original prompt could override the training behavior to refuse to reply to unsafe prompts. This approach emphasizes the importance of addressing the robustness of LLMs against perturbation. A work by Chaudhary et al.22 to evaluate the Google Gemini 1 robustness against perturbation and compare the performance with human performance for providing a score for text datasets. This work used two dataset, SummEval23 and USR24, and showed that Google Gemini 1 falls short with the perturbations injected in the prompts.

Fig. 1
figure 1

The pipeline for our proposed approach of assessing the robustness of LLMs against perturbations. Dotted arrows are for perturbed text prompts and the subsequent responses, whereas solid arrows depict clean text prompts and the subsequent resultant LLM responses.

Experiment

Dataset

In this experiment, we utilized two publicly accessible datasets to evaluate sentiment analysis models. The first dataset is the Amazon Reviews dataset, which can be obtained from the HuggingFace platformFootnote 1. This dataset comprises a collection of user-generated textual reviews for a wide range of Amazon products. It contains a total of 4,920 individual entries, each representing a single review written by a customer. These reviews encapsulate subjective feedback, opinions, and sentiments regarding the quality and performance of various products, making the dataset suitable for sentiment classification tasks.

The second dataset employed in our study is a compact version of the IMDb movie reviews dataset, also available through HuggingFaceFootnote 2. This dataset includes 3,000 movie reviews authored by users, each comprising a single entry of free-text feedback about a particular film. These reviews often express nuanced emotional reactions and personal evaluations, offering rich contextual information for training and evaluating sentiment analysis models. The linguistic diversity and varying intensity of sentiment present in this dataset make it particularly valuable for developing models capable of understanding and interpreting natural language sentiment.

Large language models

In this study, we have used six Large Language Models (LLMs), which are: Gemma3 20B, Gemma3 4B, GPT 4o 12B, and Llama3.2 4B, Llama3.3 70B, and Phi 3B. All the models were used locally through the Olama software package, except GPT-4o, and Llama 3.3 70B for which we used APIs including the OpenAI API and the Togther.ai API. This selection spans a range of architectures and parameter sizes, enabling a comprehensive analysis of how model scale and design impact resilience to input noise. Each LLM was prompted with a single text at a time (as detailed in the experiment below), while setting the temperature parameter to zero, to ensure that we obtained the same response given the same input prompt and to eliminate the stochastic sampling of output words.

Method

To simulate realistic input variations, we introduced two types of perturbations: keyboard typos and word replacement. These were applied using the Textflint library, following the methodology described in Romero et al. (2024)14 and Wang et al. (2021)25. The keyboard typo perturbation randomly replaces characters in words with those on adjacent keys, often leading to malformed or misspelled tokens, which can obscure or distort meaning. In contrast, the word replacement mostly maintains semantic intent while varying surface form by replacing words with their lexical equivalents–reflecting the variability commonly seen in user-generated content. The Textflint package uses the WordNet database for finding the synonyms of words.

All models were evaluated using a consistent sentiment analysis prompt template: “Write two sentences at maximum describing the sentiment in the following text: [Review].” Here, [Review] was replaced by each review sample in our dataset. Table 1 shows representative examples of both original and perturbed reviews. This setup allows us to measure how well the models generalize across both semantically-preserving and semantically-altering changes, revealing insights into their practical robustness under noisy or variable input conditions.

To evaluate the resilience of an LLM denoted by f, we iterated over each original sample \(x_o\) in the datasets \(D_1\) and \(D_2\), and applied the perturbation step (i.e., keyboard and word replacement) to get the perturbed text \(x_p\). Then the response r from the LLM f is received by \(r_o = f(x_o)\), where \(r_o\) represents the response when the LLM f is prompted with the original text sample \(x_o\). Similarly, for the perturbed text \(x_p\), we prompt the LLM f such that \(r_p = f(x_p)\), where \(r_p\) is the response of the LLM when prompted with a perturbed text \(x_p\). In other words, LLM model outputs were compared between original and perturbed inputs to quantify resilience to both semantic-preserving and semantic-altering variations. This experimental design enables a controlled evaluation of large language models’ ability to generalize under realistic, noisy input conditions. In Figure  1, we show the pipeline of our evaluation approach of the resilience of text generation LLMs against text perturbations.

Table 1 Examples of the review texts in the dataset and the perturbation resulting from using the keyboard and word replacement approach generated using the textflint package.

Evaluation

To systematically assess the robustness of the six large language models (LLMs) when exposed to perturbed text prompts, we employed a comprehensive evaluation framework utilizing three widely adopted performance metrics: BLEU score, ROUGE-L F1 score, and Jaccard Index. These metrics collectively capture different dimensions of output quality, including n-gram overlap, recall-oriented summarization accuracy, and set-based similarity, respectively. This multi-metric approach enables a more nuanced understanding of each model’s sensitivity to textual perturbations, facilitating a robust comparative analysis across LLM variants.

The BLEU score (Bilingual Evaluation Understudy) is a precision-based metric originally developed for evaluating machine translation systems26. It measures the overlap of n-grams between the generated LLM response of the original prompt and the LLM response of the perturbed prompt, with higher scores indicating closer alignment to the original prompt-based response. While commonly used in translation, it has also been widely adopted for assessing the lexical similarity of text generated by language models27. The ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation – Longest Common Subsequence) score, on the other hand, emphasizes recall by evaluating the longest common subsequence between a candidate and a reference text, making it particularly effective for summarization and tasks where capturing content coverage is essential28. Finally, the Jaccard Index quantifies the similarity between two sets by dividing the size of their intersection by the size of their union. When applied to text, it typically compares sets of words, offering a straightforward measure of lexical overlap29. Together, these metrics provide a complementary evaluation suite that captures different aspects of similarity and divergence between generated and reference texts.

Results and discussion

Table 2 Performance metrics for different LLMs under keyboard and word replacement perturbation using amazon review dataset.
Table 3 Performance metrics for different LLMs under Keyboard and word replacement perturbation using IMDb movie review dataset.

The results presented in Table 2 compare the performance of six large language models (LLMs)–Gemma3-20b, Gemma3-4b, GPT-4o-12b, Llama3.2-4b, Llama3.3-70b, and Phi-3b–under two types of text perturbations: keyboard perturbations and word substitutions, using the Amazon review dataset. Across the board, GPT-4o 12b consistently outperforms the other models, achieving the highest scores in all three metrics (BLEU, ROUGE-L, and Jaccard Index) under both perturbation types. For instance, under keyboard perturbation, it records a BLEU score of 0.466 ± 0.246, a ROUGE-L score of 0.691 ± 0.169, and a Jaccard Index of 0.555 ± 0.203. This indicates weak robustness in maintaining semantic and lexical alignment with the original text-based LLM response, after introducing the keyboard and word replacement-based perturbation noise.

Fig. 2
figure 2

The mean and standard deviation of BLEU score, Rouge-l f1 score, and Jaccard Index for six LLMs resilience against perturbation using the Amazon review dataset. The first row shows the results for the keyboard-based perturbation, and the second row shows the results for the word replacement-based perturbation.

In contrast, lighter models such as Phi-3b and Gemma3-4b exhibit weaker performance, with significantly lower scores across all metrics. This performance gap highlights the advantage of larger-scale models in handling perturbed inputs more effectively, likely due to their greater parameter capacity and training data exposure. Among the Llama models, Llama3.3 70b shows a slight resilience, especially in word replacement perturbations, where it nearly rivals GPT-4o with a ROUGE-L score of 0.661 \(\pm\) \(\pm\) 0.183 and a Jaccard Index of 0.529 ± 0.213. Overall, the results suggest that model size and architectural advancements influence robustness against input noise, and GPT-4o leads in maintaining robustness under both forms of perturbation. However, all results are considered very weak and show that perturbations have a significant effect on the LLM’s performance. More detailed results are presented in Table 2.

Table 3 presents a comparative analysis of the robustness of six large language models (LLMs)–Gemma3-20b, Gemma3-4b, GPT-4o-12b, Llama2-4b, Llama3.3-70b, and Phi-3b–under two types of input perturbations: keyboard-based noise and word replacement perturbation, evaluated using the IMDb movie review dataset. The evaluation metrics include the BLEU score, ROUGE-L F1 score, and Jaccard Index, each reflecting different aspects of text similarity and semantic preservation. Among the models, GPT-4o-12b, Llama2-4b, and Llama3.3-70b consistently achieved superior results across all metrics and perturbation types, indicating a high degree of robustness and semantic fidelity under noisy input conditions. Notably, GPT-4o-12b achieved the highest Jaccard Index (0.570 ± 0.213) and ROUGE-L score (0.696 ± 0.175) under keyboard perturbation, while Llama3.3-70b achieved the highest BLEU score (0.541 ± 0.256), underscoring its capability to maintain both lexical and semantic integrity despite noise.

Fig. 3
figure 3

The mean and standard deviation of BLEU score, Rouge-l f1 score, and Jaccard Index for six LLMs resilience against perturbation using the IMDb movie review dataset. The first row shows the results for the keyboard-based perturbation, and the second row shows the results for the word replacement-based perturbation.

Under word replacement perturbation, the top-performing models remained consistent, with Llama3.3-70b, GPT-4o-12b, and Llama2-4b again leading the group. GPT-4o-12b and Llama3.3-70b performed particularly well on ROUGE-L (0.675 ± 0.170 and 0.671 ± 0.187, respectively) and Jaccard Index (0.544 ± 0.202 and 0.548 ± 0.216, respectively), which are indicative of their ability to preserve semantic meaning despite lexical changes. In contrast, smaller models like Gemma3-4b and Phi-3b showed relatively lower performance across all metrics, particularly in BLEU and Jaccard Index, suggesting greater sensitivity to perturbations. Overall, these results emphasize the enhanced resilience of larger and more advanced LLMs in handling noisy or altered inputs and validate their potential for real-world applications involving imperfect or adversarial text data. However, there is a significant gap in improving LLMs’ resilience against perturbed input.

In Fig. 2, we present a detailed visual analysis comparing the average performance (mean) and variability (standard deviation) of each large language model (LLM) on the Amazon Review dataset (Dataset 1). The results indicate that, under the keyboard typo-based perturbation, the GPT-4o 12B model achieves the highest Rouge-L F1 score, demonstrating superior robustness and consistency in handling such input noise. This is followed closely by the LLaMA3.3 70B, which also shows strong performance, and then the Gemma3 20B model. Interestingly, a comparable performance pattern is observed when the models are evaluated using word substitution-based perturbations. This suggests that the models’ relative performance rankings remain stable across different types of semantic and lexical disturbances, reinforcing the reliability of GPT-4o 12B in maintaining high-quality output despite variations in input text.

In Fig. 3, we illustrate and compare the mean Rouge-L F1 scores and standard deviations for the same set of LLMs, this time on the IMDb Movie Review dataset (Dataset 2). The results again highlight GPT-4o 12B as the top-performing model under keyboard typo perturbations, with LLaMA3.3 70B and LLaMA3.2 4B coming in second and third, respectively. For the word replacement perturbation in this dataset, the performance trend remains consistent: GPT-4o 12B continues to lead, followed by LLaMA3.3 70B and LLaMA3.2 4B. These consistent outcomes across both datasets and perturbation types emphasize the robustness and adaptability of GPT-4o 12B, making it a reliable choice for tasks involving noisy or altered input text.

One limitation of utilizing APIs to run large language models–such as the OpenAI API for GPT-4o or the Together.ai API for LLaMA3.3 70B–is the lack of control over the underlying hardware resources. Specifically, there is no guarantee that each prompt will be processed using the same machine configuration, which may introduce variability in the generated outputs, even when the temperature is fixed at zero. Additionally, these APIs may apply different quantization techniques behind the scenes, which can further impact the model’s repeatability and robustness. Such inconsistencies have been highlighted in prior work30, emphasizing the challenges in ensuring reproducible and stable performance when relying on third-party API access for LLM inference. (Table 4)

As illustrated in Table 1, certain words lose their semantic integrity following perturbations, rendering them meaningless or difficult to interpret. This observation underscores the importance of examining how tokenization is handled by different models. To investigate this, we analyzed the tokenization behavior of both GPT-4o and LLaMA3.3 70B. For example, the original phrase “no issues” is tokenized identically by both models:– demonstrating a clean and semantically coherent split. However, when we introduce a typographical error, changing the phrase to “no isskes”, the tokenization behavior diverges. In the case of LLaMA3.3 70B, the perturbed input is broken into: – indicating a fragmented interpretation where the model may struggle to reconstruct the intended meaning. On the other hand, GPT-4o tokenizes the same perturbed phrase as: – reflecting a slightly more cohesive segmentation, though still deviating from the original word structure.

Table 4 Examples of GPT 4o responses to original and perturbed prompts, text review, and the corresponding Jaccard Similarity index score. Words in bold show the sentiment predicted by GPT-4o.

These differences highlight a critical point: when models encounter unfamiliar or distorted words, their tokenizers tend to break them into smaller subword units. This mechanism allows the model to potentially focus on partial patterns or recognizable components of the input. Nonetheless, there is a trade-off–if the perturbed token alters the meaning significantly, the model might misinterpret or overlook crucial semantic content. This can lead to degradation in performance, particularly in tasks where word-level precision is essential.

As illustrated in Table 4, when the GPT-4o model is provided with keyboard-perturbed review texts as input, the generated responses–shown in the third column–often exhibit a shift in the interpreted sentiment, particularly when the Jaccard similarity index is low. Conversely, higher Jaccard similarity scores correspond to greater consistency between the original and perturbed inputs, with the model producing similar responses and maintaining the same sentiment interpretation. For instance, the first row shows a shift in the sentiment from positive to neutral by applying keyboard perturbation shown in the first row, second column.

The Textflint package has been proposed in25, and has been used in14. However, we have observed that the words that TextFlint used for the synonyms approach are not exact synonyms. For instance, “loves” and “likes” are related; however, the two words are not synonyms. Therefore, our future research will study the effect of the word replacement using exact synonyms.

Conclusion

In this study, we conducted a comprehensive evaluation of the robustness of six large language models (LLMs) when exposed to two categories of textual perturbations: typographical errors and word replacement perturbations within user prompts. Our experimental findings reveal that these perturbations can impact model performance, indicating that current LLMs exhibit limited resilience to input variations that commonly occur in real-world applications. Specifically, models showed differences in the responses compared to the clean prompt-based responses when presented with typo-induced distortions. Similarly, substituting words with other relevant words also led to inconsistencies in model responses, suggesting a sensitivity to lexical variations that alter surface-level token structures.

Overall, our findings emphasize the importance of improving LLM robustness to enhance their reliability and generalizability in real-world scenarios where user inputs are rarely clean or consistent. Future work should focus on developing perturbation-invariant tokenization strategies and training regimes that explicitly account for noisy or altered inputs.