Typos Barely Hurt LLMs. Clear Prompts Matter More.

I ran a test on noisy prompts in two common workflows: answering questions about a provided document and code editing.

The main result is simple: ordinary typos had little measurable effect, clear prompts improved accuracy, and having a clean reference document (a relevant text such as documentation or a spec supplied directly in the prompt) pushed the failure point from 30% to 50% noise. A more capable model also raises the accuracy floor across all conditions.

Clean:  What automatic redelivery behavior should the operator expect from GitHub?
Typo:   What automatic redelivrey behavior should the opeartor expect from GitHub?
Noise:  Whzt auzomrfic redelifsry brhxvior spuld the onhratxr exphxt from GitXub?
Questions with no context document: all three models collapse at 30% uniform noise.
Without a reference document, all three models collapse at 30% noise.
Context-grounded QA with clean document supplied: performance holds through 30% and only collapses at 50%.
With a clean document supplied, the failure point shifts: performance holds through 30% and only collapses at 50%.

Main result

  • Prompt clarity mattered more than spelling mistakes in this test.
  • Having a reference document pushed the failure point from 30% to 50% noise.
  • Random character corruption caused a sharp drop around 30% noise.

What I tested


45 question-answering tasks (each with a reference document) and 24 code-editing tasks across five noise levels. In the primary test, the model always received a clean reference document; only the request varied. Full agents, codebase navigation, and open-ended chat are out of scope.

Version of the request Typical outcome
Clean request + clean context Correct
Human typo version + clean context Usually still correct
30% random-noise version + no context Often wrong
30% random-noise version + clean context Much better than the no-context version

Two types of errors: natural typos (swaps, drops, insertions, keyboard-neighbor mistakes) and random scrambling (each character replaced randomly at a fixed rate). Each task was scored automatically. These results apply to this setup only.

Ordinary typos had little measurable effect. On QA, typo corruption reached 32 edits on a roughly 100-word request. Accuracy fell from 84% to 69% at the highest typo level (p = 0.078, not significant at 0.05). On code, GPT-5.4-mini stayed at 96% across all typo levels; GPT-5.4-nano drifted to 75% at the highest level, also not statistically significant.

Small spelling errors leave enough of each word intact for the model to understand the question (Sennrich, Haddow, and Birch, 2016). Deliberately crafted misspellings cause larger drops; that is a different problem (Pruthi, Dhingra, and Lipton, 2019). These results are expected and consistent with current literature on neural model robustness.

Random noise caused a sharp drop. Without a context document, accuracy declined at each noise level: 64% at 15% noise, 20% at 30%, and just 4% at 50%. Code editing followed the same pattern: GPT-5.4-mini scored 96% at 5% noise, then 92%, 63%, and 25% as noise reached 50%. Stronger models had a higher floor at extreme noise; the same pattern held across both task types (Belinkov and Bisk, 2018).

Having a reference document pushed the failure point from 30% to 50% noise. With a reference document, the model scored 69% at 30% noise. Without a reference document, the same questions scored 20%. A clean spec, docs excerpt, or code block gave the model a reliable reference to work from (Karpukhin et al., 2020).

At 50% noise, supplementary context determined what the model could still get right. "Structure" here means any additional information provided alongside the noisy prompt: answer options, a code block, or a reference text. Even when the prompt itself was largely unreadable, these gave the model reliable anchors to work from. The following four conditions were compared at this noise level:

Structure available GPT-5.4-nano GPT-5.4-mini GPT-5.2
None (no context) 0% 4% 4%
Multiple-choice answers 8% 8% 58%
Code context 29% 25% 42%
Text context (QA) 44% 42% 58%
At 50% uniform noise, residual structure type and model tier jointly determine accuracy.
At 50% noise, structure determines the floor.

Prompt clarity had a larger effect than typos. I wrote verbose and concise versions of the same 45 QA tasks. Verbose prompts named the full scenario and specified exactly what to answer; concise prompts asked the same thing in a shorter form:

Concise: What happens after a failed webhook delivery?
Verbose: If a GitHub webhook delivery fails and nobody clicks Redeliver, what
         automatic retry behavior should the operator expect from GitHub?

On clean prompts, verbose scored 91.1% against concise at 75.6%, a gap of 15.5% (p = 0.039) that narrowed to 11.1% under light noise.

Condition Verbose Concise Gap
Clean 91.1% 75.6% 15.5%
Light uniform noise 82.2% 71.1% 11.1%
QA prompt clarity: verbose prompts outperform concise by 15.5pp clean and 11.1pp corrupted. Significant at p=0.039.
Verbose prompts outperformed concise by 15.5% on clean prompts and 11.1% under light noise. Significant at p = 0.039.

Practical takeaway


If you use an LLM with a document, log, or code snippet as context, prompt clarity deserves more attention than spelling mistakes. This is consistent with current literature and documentation on language model robustness.

Focus on three things:

  • provide the right context
  • state the task clearly
  • specify what counts as a correct output

References