Expensive tokens for judgment, cheap tokens for labour

This phrase has become a bit of a shorthand in my setup:

Expensive tokens for judgment, cheap tokens for labour.

I’m not putting it forward as some grand theory of AI systems. It’s just a practical rule I kept arriving at after enough trial and error.

A lot of the work inside an agent setup doesn’t actually need the best available model.

That sounds obvious when stated plainly, but it took me a while to really build around it.

Where it came from

For a while, I kept treating model choice as if it were mainly about quality in the abstract.

Use the smartest model you can afford. If the task matters, spend more. If you want better output, move up the stack.

There is some truth in that, but only up to a point.

The thing that eventually became hard to ignore was that not all work inside the system was the same kind of work.

Some parts needed judgment. Some parts just needed labour.

Once I started separating those two, the routing got simpler.

What I mean by judgment

Judgment is the work where the model has to decide what matters.

That might be:

planning what to do next
deciding which path or tool makes sense
reviewing whether an answer is actually good enough
spotting risk
weighing tradeoffs
turning a messy pile of material into a final opinion

This is where I’m happy to spend more.

If the job is to make a call, or to review something carefully, or to decide whether something is safe enough to proceed, that is not where I want to be cheap.

What I mean by labour

Labour is different.

This is the volume work. The repetitive work. The first-pass work.

Things like:

scanning a batch of sources
extracting the obvious parts
rough summarising
reformatting
converting one structure into another
pulling together material for a later review pass

This work still matters. It just usually doesn’t need the smartest model in the room.

And if you use your most expensive model for all of it, you end up paying premium rates for shovel work.

That may be fine at small scale. It gets old pretty quickly once the system starts doing more.

This changed how I think about routing

Once I started thinking in those terms, model routing stopped feeling like a question of taste and started feeling more like resource allocation.

That’s a much more useful frame.

I’m not really trying to answer, “Which model is best?”

I’m usually trying to answer something more ordinary:

Which model is good enough for this kind of work, given what this piece of work actually is?

In my setup, that generally means:

stronger frontier models for orchestration, review, and the parts where judgment really matters
local models for bulk processing, recurring scans, and other first-pass labour
Claude Code in Docker for implementation and review where code quality is more important than thrift

The names will change over time. The principle still holds.

A research example

The research pipelines made this pretty clear.

If I’m scanning a bunch of sources, I do not need a frontier model to lovingly read every item from top to bottom and produce a polished take on each one.

What I need first is something that can do the grunt work:

fetch the material
extract the relevant parts
strip out the junk
produce rough summaries
surface the items that look worth a closer look

That is labour.

Then, once that pile has been reduced, a stronger model can step in and do the part that actually benefits from better judgment:

what really matters here?
what is signal and what is noise?
what belongs in the final briefing?
what is just repetition dressed up as novelty?

That split saves money, but it also helps with focus. The stronger model is spending its effort where it is most useful, rather than chewing through the first-pass workload.

The same thing applies to coding

Coding has its own version of this.

There are parts of software work where I want stronger models involved: architecture choices, careful review, security-sensitive checks, or anything with enough edge cases that a weak answer becomes expensive later.

But there is also plenty of coding work that is mostly implementation inside a clear frame.

That doesn’t mean I think the cheapest model should do all of it. I don’t.

It just means I stopped pretending every part of the process deserved the same level of model.

Sometimes paying more is the right call. Sometimes it’s wasteful. The point is to tell the difference.

Why I don’t like “best model everywhere” as a default

I understand the appeal of just using the strongest model for everything.

It’s simple. It saves you from making routing decisions. It also feels like the safe option, because at least you know you’re not under-spending.

But it creates its own problems.

It’s slower. It’s more expensive. It burns limited capacity on work that often doesn’t justify it. And it encourages lazy system design, because you stop asking what kind of work is actually being done.

That last part matters.

If every routing question gets answered with “just use the best model,” then you never really design the system. You just throw horsepower at it.

Sometimes that works. It also gets expensive and untidy very quickly.

Cheap labour is only useful if it is good enough

There is an obvious caveat here.

Cheap models are not free just because they are cheap.

If a weak model does the labour badly, the cost comes back somewhere else. You pay for it in cleanup, false positives, missed signals, or review passes that have to rescue the output.

So there is still a minimum standard.

The labour model has to be good enough that the stronger model is refining decent material rather than trying to rescue rubbish.

That’s the line I care about.

I’m not trying to get cost to zero. I’m trying to avoid paying judgment prices for work that doesn’t need judgment.

Where I ended up

This principle made the whole system feel calmer.

Not smarter in some abstract sense. Just calmer. Less wasteful. More deliberate.

The stronger models were being used where I actually wanted their judgment. The cheaper or local models were doing the repetitive work they were good enough to do. The setup made more sense to me.

It also made it easier to explain to myself.

Not “this model is my favourite.” Not “this model topped some benchmark.” Just: this model is doing this job because this job needs this kind of thinking.

That is a much better way to build.

The short version

If I had to reduce it to one line, it would be this:

Good agent systems are not only model systems. They are allocation systems.

Or, more plainly:

Spend heavily where judgment matters. Save money where the work is mostly labour.

That’s all this phrase is really trying to say.