JIT
← Notebook
Idea notes AI AgentsArchitectureCloud

Own the baseline, rent the bleeding edge

Token prices are heading for a real-cost reckoning. The way through is the one we use for compute: own the AI baseline, burst for the bleeding edge.

Right now, frontier-model tokens are cheap in a way that won’t last. The big labs are pricing for a land-grab, not for the power bill — and at some point the price has to walk back toward what the inference actually costs. When it does, the teams who wired their whole workflow to metered frontier tokens are the ones who’ll feel it first: either the invoice bites, or the rate limit does. I’ve seen this movie before. It was called cloud compute.

We already solved this once

Nobody runs their steady-state compute on on-demand cloud pricing and calls it a strategy. You put the baseline on hardware you own — or colo — where the cost is predictable, and you burst into the cloud for the spiky, the novel, the genuinely elastic. Capex for the floor, opex for the peak. AI is about to grow the same shape, for the same reason: the economics force it.

Two tiers of model

Not every token is worth frontier money. The bleeding-edge work — hard reasoning, a novel problem, the thing you’re figuring out for the first time — pay the premium there, it earns its keep. But most of an agent’s life — and I run a whole fleet of them — is bounded and repetitive: classify this, extract that, summarise the ticket, run the same loop a thousand times. That’s baseline load. You don’t want it metered at top-tier prices, and it doesn’t need to be — an open-weight model on GPUs you own, served with something like Ollama or vLLM, does that work fine, and the marginal token is effectively free.

The router is the new load balancer

What makes it a stack and not two silos is the layer in between. A gateway — OpenRouter, or a self-hosted LiteLLM — that routes each call by policy, not by hand. Cheap and bounded → the local model. Hard and novel → the frontier API. Sensitive data → never leaves owned infrastructure, full stop. You decide it per prompt, or per agent run, with a cost ceiling attached. The payoff: the cost curve of owned hardware, the burst capacity of the cloud, and — the part that pays my mortgage — an audit trail of which model saw which data.

Prompt a single request Agent run a loop, many calls Token router — the gateway OpenRouter · self-hosted LiteLLM — route by policy, not by hand cost ceiling task class data sensitivity bounded · repetitive or sensitive data hard · novel first-time work Local / owned models open-weight · Ollama · vLLM on GPUs you own — the baseline marginal token ≈ free Frontier API premium · bleeding edge rented — pay per token reserve for the hard stuff
A prompt or an agent run hits the router; policy picks the tier. Bounded, repetitive, or sensitive work stays on owned hardware; the bleeding edge rents the frontier.

Treat tokens like compute: own the baseline, rent the bleeding edge.

None of this is humming in my rack today — local-LLM serving is still more fiddle than an API key, and the frontier moves fast enough that renting it is the honest call. But the price signal already points this way, and the pattern is sitting on the shelf, fully specced, for the day the invoice makes the case for me.