Typos Barely Hurt LLMs. Clear Prompts Matter More.

I ran a benchmark on noisy prompts in two common workflows: context-grounded question answering (QA) and code editing. The main result is simple: ordinary typos had little measurable effect, clear prompts improved QA accuracy, and clean context preserved accuracy at higher noise levels.

Open trivia with no context: all three models collapse at 30% uniform noise.
Without a context block, all three models collapse at 30% noise.
Context-grounded QA with clean document supplied: performance holds through 30% and only collapses at 50%.
With a clean document supplied, the cliff shifts — performance holds through 30% and only collapses at 50%.

Main result

  • Prompt clarity mattered more than spelling mistakes in this benchmark.
  • Clean context preserved accuracy one noise tier longer.
  • Random character corruption created a sharp failure point around 30%.

One concrete example


The benchmark included a task about GitHub webhook redelivery. The model received a clean docs excerpt and a short request asking what happens after a failed delivery if nobody clicks Redeliver.

Here is the pattern from that task family in one view:

Version of the request Typical outcome
Clean request + clean context Correct
Human typo version + clean context Usually still correct
30% random-noise version + no context Often wrong
30% random-noise version + clean context Much better than the no-context version

That pattern captures most of the benchmark. The request variants themselves looked like this:

Clean:  What automatic redelivery behavior should the operator expect from GitHub?
Typo:   What automatic redelivrey behavior should the opeartor expect from GitHub?
Noise:  Whzt auzomrfic redelifsry brhxvior spuld the onhratxr exphxt from GitXub?

Scope


This benchmark studies one common setup: the model receives the relevant document or code snippet, and the user request contains noise.

Out of scope:

  • full agents
  • codebase navigation
  • open-ended chat
  • from-scratch code generation

What I tested


I used 45 context-grounded question answering (QA) tasks and 24 code-editing tasks across five noise levels. The model always received clean context. Only the request changed.

Examples of the two task types:

  • Question answering (QA): read a docs excerpt and answer a precise factual question. Example: "If a GitHub webhook delivery fails and nobody clicks Redeliver, what automatic retry behavior should the operator expect?"
  • Code editing: modify an existing Python function to satisfy a small change request and fixed tests. Example: "Update a parser so repeated query parameters preserve all values instead of overwriting the last one."

I used two noise families:

  1. Natural typos: swaps, drops, insertions, and keyboard-neighbor mistakes.
  2. Uniform random noise: each character is independently replaced at a fixed rate.

Earlier work treats natural and synthetic noise as different problems for the same reason: they stress models in different ways (Belinkov and Bisk, 2018).

Why this sample is enough for the claims in this post

These task counts are enough for statistically valid benchmark-level claims for three reasons.

  1. Paired design. Each base task appears in clean and corrupted forms, so each task acts as its own control.
  2. Objective scoring. QA uses exact-match or locked alias scoring. Code uses deterministic unit tests.
  3. Clear effect sizes. The main QA prompt-style result clears a two-tailed sign test on the 45-task slice: verbose beats concise on 8 tasks, loses on 1, and yields p = 0.039. The uniform-noise cliff is much larger than that effect and appears across task families and model tiers.

The code prompt-style result remains weaker after prompt cleanup, so it stays in a secondary role. These numbers support narrow claims about this benchmark setup. They support benchmark-level claims for this design only.

Result 1: Ordinary typos had little measurable effect


Finding: ordinary typos had little measurable effect in this setup.

On QA, I pushed typo corruption up to 32 edits on a roughly 100-word request. Accuracy went from 84% on clean prompts to 69% at the most extreme typo budget. That drop yielded p = 0.078, which places it above a conservative 0.05 threshold on this sample.

On code, GPT-5.4-mini stayed at 96% across all typo tiers. GPT-5.4-nano drifted to 75% at the extreme tier, and that pattern remains suggestive rather than firm on this slice.

QA: natural typos produce no measurable degradation. p=0.078, not significant.
Context-grounded QA: accuracy stays essentially flat across all human-realistic typo tiers.
Code editing: natural typos produce no measurable degradation. Mini stays flat; nano drifts slightly at extreme tier.
Code editing: Mini stays flat across all typo tiers; Nano drifts slightly at the extreme tier but the pattern is not significant.

This result fits two parts of the literature. Small spelling errors often preserve enough local structure for subword tokenization to recover useful signal (Sennrich, Haddow, and Birch, 2016). Targeted adversarial misspellings are a different case and can cause much larger drops (Pruthi, Dhingra, and Lipton, 2019).

Result 2: Random character corruption caused a sharp cliff


Finding: uniform random noise broke the request once the text stopped looking like language.

This noise family is a stress test for legibility loss. It shows the failure boundary clearly.

On open trivia, accuracy went from 64% at 15% noise to 20% at 30%, then 4% at 50%. Code editing showed the same cliff. Model tier changed the floor. The cliff itself appeared at roughly the same location across model tiers.

That split between natural typos and synthetic corruption matches the general distinction reported by Belinkov and Bisk (2018).

Result 3: Clean context preserved accuracy at a higher noise level


Finding: clean context preserved useful performance at 30% noise.

Context-grounded QA scored 69% at 30% noise. The same question family without a context block scored 20% at the same noise level.

With context, performance stayed usable through 30% noise and then collapsed at 50%. In practical terms, a clean spec, docs excerpt, or code block gave the model enough intact structure to recover the task more often.

That pattern is close to what dense retrieval work finds with degraded queries and a clean index: the retrieval target stays recoverable while enough anchor text survives in the query (Karpukhin et al., 2020).

Result 4: Structure mattered in the failure zone


Finding: at very high noise, residual structure determined how much accuracy remained.

At 50% noise, open recall was close to zero. Answer choices, code context, and text context all raised the floor.

Structure available GPT-5.4-nano GPT-5.4-mini GPT-5.2
None (open trivia) 0% 4% 4%
Answer choices (MCQ) 8% 8% 58%
Code context 29% 25% 42%
Text context (QA) 44% 42% 58%
At 50% uniform noise, residual structure type and model tier jointly determine accuracy.
At 50% noise, structure determines the floor.

This result also fits prior reasoning work. When the prompt contains a small hypothesis set, such as answer choices, stronger models can score alternatives against partial evidence more effectively (Wei et al., 2022).

Result 5: Clear prompts improved QA accuracy more than typos did


Finding: on QA, prompt clarity had a larger effect than typo noise.

I wrote verbose and concise versions of the same 45 QA tasks. On clean prompts, the verbose version scored 91.1% and the concise version scored 75.6%. The gap was 15.5 percentage points (p = 0.039).

Under light uniform noise, the gap was still 11.1 points.

Condition Verbose Concise Gap
Clean 91.1% 75.6% 15.5 pp
Light uniform noise 82.2% 71.1% 11.1 pp
QA prompt clarity: verbose prompts outperform concise by 15.5pp clean and 11.1pp corrupted. Significant at p=0.039.
QA: verbose prompts outperform concise by 15.5 pp on clean prompts and 11.1 pp under light noise. Significant at p = 0.039.
Code prompt clarity: the gap disappears after prompt repair. Not significant at p=0.219.
Code: the prompt-style gap disappears after prompt cleanup. Not significant (p = 0.219).

The direct reading is simple: when the model must retrieve a precise fact from context, a prompt that states the scope and required output shape explicitly produces better results.

Result 6: Code editing followed the same noise cliff


Finding: code editing showed the same main noise boundary.

On uniform random noise, GPT-5.4-mini scored 96% / 92% / 63% / 25% at 5% / 15% / 30% / 50% noise. GPT-5.4-nano degraded earlier, but both models reached the main cliff at 30%.

Code editing hits the same 30% noise cliff as trivia. GPT-5.4-mini and nano shown.
Code editing follows the same overall pattern: stable while the request is readable, then sharply worse once the request loses word-level structure.

The code prompt-style result is weaker than the QA result after prompt cleanup. It serves as supporting evidence rather than the headline result.

Why this probably happens


The mechanism looks straightforward and the literature points in the same direction.

  1. Ordinary typos preserve partial structure. Subword tokenization keeps some useful local information when spelling errors are small (Sennrich, Haddow, and Birch, 2016).
  2. Synthetic corruption destroys anchors. Natural and synthetic noise behave differently in prior work, and they behaved differently here as well (Belinkov and Bisk, 2018).
  3. Clean context adds an intact target. The model can match degraded fragments in the request against a clean passage or code block, which is close to the retrieval pattern reported by Karpukhin et al. (2020).
  4. Targeted misspellings are harder than untargeted typos. That distinction helps explain why typo-only damage stayed small here while adversarial misspelling papers report stronger effects (Pruthi, Dhingra, and Lipton, 2019).
  5. Structure helps reasoning under damage. Answer choices and other explicit structure give stronger models a constrained set of alternatives to compare, which aligns with the general reasoning gains reported by Wei et al. (2022).

This explanation fits the benchmark and the cited work. The evidence supports this level of causal interpretation.

Practical takeaway


If you use an LLM with the relevant document, log, or code snippet in front of it, prompt clarity deserves more attention than spelling mistakes.

Focus on three things:

  • provide the right context
  • state the task clearly
  • specify what counts as a correct output

Ordinary typos often had little measurable effect in this setup. Vague requests reduced QA accuracy much more. Random character corruption remained a real failure mode.

References