Back to blog

Context Auto-Escalation: Stop Burning Tokens

ai-engineeringperformancearchitecture

There's a common mistake when working with AI: give it maximum context upfront.

It feels logical. In practice — it's expensive and mostly pointless.

For the majority of tasks, the model needs very little: a list of files, short snippets, a minimal description. But sometimes that genuinely isn't enough.

And that's where a technique I started using quite often comes in — auto-escalation.

How it looks in practice

I was building a plugin that generates commit messages. I didn't want to feed the model the entire repository every time.

So the baseline step is simple:

  • The LLM gets a list of changed files
  • Short snippets of the diff
  • Minimal context

And it returns a result with a confidence score.

If the confidence is below a threshold (say, 0.6–0.7) — auto-escalation kicks in. The model gets more context: longer code chunks, full diffs, additional details about the surrounding code.

If the confidence is high — nothing happens. The cheap answer was good enough.

What this actually changed

  • Most requests stay cheap — minimal tokens, fast response
  • Complex cases automatically get more attention
  • Accuracy doesn't drop, because the system knows when it needs help
  • Tokens stop burning for no reason

The key insight is simple: context should be dynamic, not "one size fits all."

Start with the minimum. Escalate only when it's genuinely needed. Most of the time, you'll be surprised how little the model actually requires to do a good job.