Can tech companies learn to love cheaper AI models?

Overview

Tech companies are starting to route most everyday AI work to cheaper models instead of defaulting to the most expensive frontier systems, a change that would reshape the economics of the field. The clearest example comes from legal AI firm Harvey, which cut inference costs by about 3x with no drop in quality by pairing an open-source model with Anthropic's Claude Opus and sending only the hardest tasks to the pricier system. Coinbase co-founder Brian Armstrong predicts 80 percent of AI workloads will run on far cheaper models within 12 to 18 months.

Key Takeaways

For years the industry assumed bigger always meant better, so companies reached for the priciest frontier models by default. Rising bills are forcing a rethink.
Harvey paired an open-source worker model with Claude Opus as an occasional advisor and cut inference costs by roughly 3x while matching or beating quality.
Brian Armstrong predicts 80 percent of AI workloads will run on models that cost 99 percent less within 12 to 18 months, leaving only 20 percent on top-tier systems.
The shift threatens the revenue assumptions behind OpenAI and Anthropic as both head toward public listings.
The new measure of quality is not using the strongest model for everything, but the cheapest model that still reaches the right answer.
Buyers now compare comparable tiers, such as GPT-5.5 against GPT-5.4-mini, or open-weight options like DeepSeek V4 Flash, rather than always choosing the flagship.

Stats & Key Facts

#80 percent of AI workloads predicted to run on 99 percent cheaper models within 12 to 18 months
#Harvey reported a 3x reduction in inference costs with no quality loss
#Hybrid setup passed 18 of 100 benchmark tasks at a total cost of 368 dollars
#Standalone Claude Opus 4.7 passed 14 of 100 tasks at a cost of 954 dollars
#The hybrid approach delivered higher scores at roughly 39 percent of the cost, a saving near 61 percent
#The frontier advisor was invoked an average of 0.83 times per task across the test run

Why default-to-frontier pricing is breaking down

The old assumption that the biggest model always wins is colliding with real bills.

For years the AI industry treated larger, more expensive models as the safe choice for nearly every task. Investor money subsidized that habit by keeping premium pricing low enough to use everywhere.

As subsidies fade and real costs surface, companies are looking harder at where they spend. If routine work shifts to low-cost options without hurting output, the math behind the whole field changes.

Brian Armstrong's 80 percent prediction for cheaper workloads

Coinbase's co-founder put a number on how far the shift might go.

›Armstrong forecast that 80 percent of AI workloads will run on models costing 99 percent less within 12 to 18 months.
›Only about 20 percent of tasks, in his view, will still need the newest top-tier systems for high-intelligence work.
›If routine jobs move to cheap models without losing quality, revenue assumptions for labs approaching public listings come under pressure.

How Harvey paired an open-source worker with a frontier advisor

The legal AI company built a hybrid system rather than picking one model.

Working with inference platform Fireworks AI, Harvey put the open-source GLM 5.1 model at the center as the worker that handles most of the reasoning, drafting, and tool calls. Claude Opus 4.7 sits beside it as an advisor the worker calls only on the hardest sub-tasks.

This reframes the expensive frontier model from a load-bearing dependency into a specialist consulted when needed. Across the test run, the Opus advisor was invoked an average of 0.83 times per task, a sign of how sparingly the costly model gets used.

The benchmark numbers behind the 3x cost cut

Side-by-side testing shows the hybrid beating the standalone flagship on both score and price.

›GLM 5.1 alone passed 12 of 100 tasks on the legal benchmark slice.
›The GLM 5.1 plus Opus advisor hybrid passed 18 of 100 tasks at a total cost of 368 dollars.
›Standalone Claude Opus 4.7 passed only 14 of 100 tasks at a cost of 954 dollars.
›The hybrid scored higher at roughly 39 percent of the cost, a saving near 61 percent.
›Mean scores were close across systems: about 0.892 for GLM 5.1, 0.911 for Opus 4.7, and 0.892 for GPT-5.5.

Why Harvey says quality still comes first

The firm frames this as a change in what quality means, not a retreat from it.

Harvey co-founder Gabe Pereyra said quality comes first and in legal work it always will. His point is that the definition of quality is shifting.

The goal is moving from using the most capable model for everything to using the best model that reaches the right answer most efficiently. In a field where wrong answers carry real consequences, the hybrid had to match accuracy before cost mattered.

What the shift means for OpenAI, Anthropic, and the IPO math

Cheaper routing puts pressure on the labs selling the priciest models.

Named players in the cost debate include OpenAI with GPT-5.5 and GPT-5.4-mini, Anthropic with Claude Opus, and DeepSeek with V4 Flash. The savings often come from switching between comparable tiers or to open-weight alternatives, not only from open-source versus proprietary choices.

OpenAI and Anthropic are both moving toward public listings, and their financial outlook leans on heavy use of premium models. If enterprise buyers weigh efficiency on every job rather than reaching for the flagship, that assumption weakens.

Frequently Asked Questions

What does Harvey's 3x cost reduction actually mean?

Harvey cut its inference costs to roughly a third of what running everything on a top-tier model would cost. It did this by handling most work with a cheaper open-source model and calling the expensive model only for the hardest tasks, with no drop in quality.

Does using a cheaper model lower the quality of the output?

In Harvey's testing it did not. The hybrid setup passed 18 of 100 benchmark tasks versus 14 for the standalone flagship, so it scored higher while costing far less.

What is Brian Armstrong predicting about AI models?

The Coinbase co-founder predicts that 80 percent of AI workloads will run on models costing 99 percent less within 12 to 18 months, leaving only about 20 percent for the newest top-tier systems.

Why does this matter for OpenAI and Anthropic?

Both companies are heading toward public listings, and their revenue outlook depends on heavy use of premium models. If buyers route most work to cheaper options, that assumption comes under pressure.

What is a hybrid model approach?

It is a system where a cheaper model does the bulk of the work and a more expensive model is called in only for the most difficult parts. This keeps overall quality high while sharply reducing cost.

The clearest signal from Harvey is that smart routing between cheap and premium models can match quality at a fraction of the cost. If that pattern spreads, the economics of AI shift from default-to-flagship toward paying for the most capable model only when a task truly needs it.

Why It Matters for Business

Real business deployments are the most reliable signal of where AI is generating measurable ROI. Watching which sectors operationalize AI, what they pay for it, and how it changes their P&L tells you more than any vendor demo. These case studies are what serious buyers and investors triangulate on.