thedailyqueryThe Daily Query — homeSubscribe

Small Models Are Eating the Enterprise: Why CTOs Are Quietly Downsizing Their AI

Frontier models grab the headlines, but a counter-trend is reshaping enterprise AI budgets: small, fine-tuned models running cheap, fast, and on-premise. The economics are hard to argue with.

By The Daily Query · · 3 min read

There's a version of the AI story where every company routes everything through the biggest frontier model available, forever. That's the version the headlines tell. The version showing up in enterprise budgets looks different: a steady migration of production workloads from giant general-purpose models to small, specialized ones — and CFOs are the ones driving it.

The 90/10 realization

Here's the pattern, repeated across dozens of engineering blogs and earnings calls this year: a company launches an AI feature on a frontier model because it's the fastest path to "does this even work?" The answer is yes. Usage grows. The invoice grows faster.

Then someone runs the analysis and finds that 90 percent of their traffic is repetitive, well-scoped, and nothing like the open-ended reasoning the frontier model was built for. Classifying support tickets. Extracting fields from invoices. Summarizing meeting notes in a house style. Tasks with clear inputs, clear outputs, and mountains of historical examples.

For that 90 percent, a small model fine-tuned on the company's own data routinely matches the big model's quality — at one-tenth to one-fiftieth the cost per call, with latency low enough to feel instant.

The frontier model doesn't disappear. It gets reassigned to the 10 percent that actually needs it, plus a new job: generating training data to make the small models better. The big model becomes the teacher, not the workhorse.

Why now

Three forces converged to make this practical rather than aspirational:

Open-weight small models got genuinely good. The gap between a well-tuned small model and a frontier model on narrow tasks has collapsed. On broad reasoning it remains real; on "classify this document into one of twelve categories," it's gone.

Fine-tuning stopped being a research project. What used to require an ML team and a GPU cluster is now a managed service with an API. Upload examples, wait, deploy. The expertise barrier dropped from "PhD" to "diligent engineer with an afternoon."

On-premise became a feature, not a compromise. For banks, hospitals, and defense contractors, a small model running inside their own walls solves the data governance conversation that cloud APIs never quite could. Compliance teams have become unexpected champions of the small-model movement.

The new enterprise stack

The architecture emerging from all this is a router, not a monolith:

  • Tier 1 — small fine-tuned models handle the high-volume, well-defined work. Cheap, fast, often self-hosted.
  • Tier 2 — mid-size general models catch requests the small models flag as out of scope.
  • Tier 3 — frontier models handle genuinely hard reasoning, plus offline jobs: evaluating outputs, generating synthetic training data, and proposing improvements to the tiers below.

Teams running this pattern report cost reductions of 70 to 95 percent on inference, which is the kind of number that turns an experiment into a mandate.

What it means for the model providers

This isn't bad news for frontier labs — it's a repositioning. Their models are becoming the R&D layer of the stack: the thing you use to figure out what's possible and to manufacture the data that makes cheap deployment possible. High-margin, lower-volume.

The squeeze lands hardest on the middle: models too big to be cheap, too small to be smartest. The market is bifurcating into "best" and "good enough, but free-ish," and the space between is getting thin.

The takeaway

The most consequential AI trend of the year might not be a capability. It might be a routing decision. Enterprises figured out that intelligence, like compute before it, is something you provision in the right size — and most jobs are smaller than the hype suggested.

If you're building on AI today, steal the playbook: prototype on the best model you can get, measure what your traffic actually looks like, then ruthlessly downsize everything that doesn't need brilliance. Your invoice — and your latency chart — will thank you.

enjoyed this one?_

Get the next one in your inbox.

One email every morning. The AI news that matters, decoded in five minutes.

up_next → Industry

AI Compute Spending: The Numbers Nobody Says Out Loud

Hundreds of billions in capex, data centers the size of small towns, and a revenue line that hasn't caught up yet. A clear-eyed tour of AI's infrastructure bet — and the three ways it could resolve.

// keep_reading