Where I run cheap models and where I don't
Hermes Agent is the bit doing the coding-agent work in my setup. I keep it in a container. A frontier model sits up top for the parts where I want actual judgment. A local model on the inference node handles bulk work. The routing is simple enough that each step calls the tier I think fits it.
I didn’t sit down and design the split all at once. It drifted into shape after I got tired of paying frontier prices for what was basically shovel work. The shorthand I use for it now, mostly to myself, is “expensive tokens for the judgment, cheap tokens for the labour.”
How it drifted that way
Early on I treated model choice as a quality question. Pick the best one you can afford and use it everywhere. If the work matters, spend more.
That works up to a point. Then the bills nudge you into looking at the work more honestly.
A lot of what the system was doing didn’t really need the smart model. It needed something that could read a page without falling over, pull the obvious bits out, and hand the result up the chain. The frontier model was being wheeled in to do warehouse work.
Once I started separating “decide” from “process”, the routing got simpler.
What I count as judgment
In my setup, judgment is the work where the model has to decide what matters. Planning the next step. Choosing between two approaches. Reviewing whether an answer is good enough. Spotting a risk in a diff. Weighing a tradeoff where the cheap answer isn’t obviously right.
This is where I’m happy to spend. If the job is to make a call or to check something carefully, I don’t want to be cheap about it.
What I count as labour
Labour is the volume work. Scanning a batch of articles. Pulling the obvious parts out. Rough summarising. Converting one shape into another. Stacking material together so a stronger model can look at it later.
It still matters — a bad labour pass poisons everything downstream — but it doesn’t usually need the best model in the room. If I send all of it to the frontier model, I’m paying premium rates to reformat text. Fine at small scale. Less fine once the system is doing more than one thing at a time.
The research pipeline
Research is where this split paid off first.
When I’m scanning thirty or forty sources, I don’t want a frontier model reading each one top to bottom. I want something that can fetch the material, pull out the relevant parts, skip the junk, and produce rough summaries with a flag on what looks worth a second look. That’s labour, and the local model handles it well enough.
Once the pile is smaller, the stronger model takes the part that benefits from actual judgment. What’s signal here, what’s noise. What belongs in the briefing. What’s just the same story retold with a different lede.
The cost saving is real, but the thing I actually notice is focus. The stronger model spends its tokens on the decisions, not on the warehouse work leading up to them.
Coding has the same shape
Coding has its own version of this. Some parts I want a strong model on — architecture choices, careful review, security-sensitive checks, anything with enough edge cases that a weak answer costs more to clean up than it saved. Hermes Agent in the container handles the implementation pass for the clear-frame jobs, then a stronger model checks the diff before anything lands.
Plenty of the rest is implementation inside a clear frame. A function with a well-defined signature. A refactor with a test suite around it. Boilerplate that has to be correct but isn’t load-bearing in a thinking sense.
I’m not claiming the cheapest model should do all of that. I just stopped pretending every part of the process deserved the same tier.
In my own routine, the bit that makes me comfortable letting a mid-tier agent do the implementation is that there’s a stronger model reading the diff before anything lands. Without that pass I’d push more of the work back up the chain. With it, the cost profile shifts and I haven’t had to clean up much.
Where it shows up day to day
In my setup the split lands as:
- stronger models for orchestration, review, and decisions
- the local node for bulk scans, extractions, and recurring first-pass work
- a coding agent in a container for implementation, with a stronger model on review duty
The specific models behind those roles rotate as new ones come out. The shape of the routing hasn’t changed much for a while.
Where it breaks
Cheap models aren’t free just because they’re cheap. If the labour tier does a bad job, the cost reappears downstream as cleanup, missed signals, or a review pass that’s mostly rescue work.
So there’s a floor. The labour model has to be good enough that the stronger model is refining decent material, not salvaging rubbish. When I’ve pushed that floor too low — usually by trying a smaller local model out of curiosity — the whole pipeline gets worse, and I notice it as more frustration rather than more savings.
The other place it breaks is when I mis-classify the task. Something I thought was labour turns out to need judgment, and the cheap pass flattens a distinction that mattered. That one’s on me, not the model. The fix is usually just to move that step up a tier and stop pretending.
Where it sits now
The routing has been stable for a while. Stronger models on the steps where I actually want them thinking. The local node on the work it’s good enough for. When I add a new step, the question I ask is which tier the work belongs in, and most of the time the answer is obvious from the shape of the step.
That’s about where I’ve left it.