Your AI Strategy Needs a CFO, Not a Fan Club

March 4, 2026

The model you choose today is a liability tomorrow.

Most companies are currently making the single most expensive mistake in their AI journey: they are treating LLM selection like a popularity contest. They choose the “celebrity” models, the ones with the loudest CEOs and the most tabloid-style hype, and they hard-code them into their business processes. This is not a strategy. It is a path to technical debt and a guaranteed ROI failure.

In our recent testing of 28 different models against thousands of business records, we found that the gap between a $1,500-a-day bill and a $15-a-day bill is often just a single, dynamic decision.

The Danger of Hard-Coded Loyalty

Developers naturally gravitate toward what is popular. They pick a celebrity model, hard-code the APIs, and begin building. But this creates a “Catch-22. APIs vary wildly across vendors. The moment a model is deprecated, the business faces a massive development project.

Hard-coding your AI to a single vendor sacrifices your agility. Instead of solving business problems, your engineering team spends its time chasing API changes and performing full test cycles just to keep the lights on. A true agentic platform must be plug-and-play. It must hide the intricacies of the handshake so you can swap models as the market evolves.

Benchmarks vs. Business Reality

The industry is obsessed with “Humanities Exam” winners. We see benchmarks showing how well a model performs on an SAT or a bar exam. This is irrelevant to businesses.

You do not need a trillion-parameter model that speaks seven languages to extract an address from an email buffer. When you use the most expensive model for menial tasks, you are hiring a PhD to do data entry. It isn’t just inefficient; it’s a “salary problem” for your AI budget.

A business-ready model selection must be based on four pillars:

  • Speed: How fast does the outcome occur?
  • Accuracy: Is the result within the acceptable threshold for this specific task?
  • Cost: Does the token consumption justify the value?
  • Risk: Is the data secure, and is the model family reliable?

The “Good Enough” Threshold

There is a pervasive myth that AI must be 100% accurate. This is an unrealistic standard that humans themselves do not meet. In a business context, “good enough” is a mathematical reality. If your current manual process has an 85% accuracy rate, and an LLM delivers 90% at one-tenth the cost, you have won.

“We have to be thoughtful about our use of GenAI just like we are thoughtful about how you deploy all the talents in your organization.” — John Michelsen

We found that celebrity models are often 15 times more expensive and six times slower than specialized alternatives that offer nearly identical quality. For a cognitive enterprise, the goal is not “the best model.” The goal is the most effective orchestration of memory, reasoning, and action at the best cost.

From Chat to Outcomes

A chat window is not a business application. Extracting data from thousands of invoices or processing warranty returns requires a different skill set than “homework cheating.”

Our testing revealed that “bigger” is rarely “better” for specific tasks:

  • RAG (Retrieval-Augmented Generation): Needs models that can cull through thousands of pages to find five significant paragraphs.
  • Document Understanding: Requires detecting if a document is a return or a warranty claim and creating structured JSON for a workflow engine.
  • Classification: Using traditional machine learning as a “Conductor” to decide if a query is simple or complex before even involving a generative model.

The $1,500 Chatbot Disaster

We recently audited a company using a celebrity model for a standard chatbot. Their bill was $1,500 a day. After moving to Krista and allowing Krista to dynamically route work, the cost dropped to $15 a day.

The original bot would have needed to generate $500,000 in value annually just to pay its own AI bill. Most companies don’t realize they are being overcharged until they receive the first huge bill. If you only use one model, your bill is guaranteed to be too high.

A Continuous Exercise

The AI and LLM rankings you see today will be the wrong answer in three weeks. This is a continuous exercise. You can’t set up your AI strategy in stone. You must adopt a platform that allows for “hot-swapping” models as skills, efficiency, and costs shift.

If you are not paying for the product, you are the product. OpenAI’s $12 billion loss in a single quarter is a signal that the current “celebrity pricing” is unsustainable. You must insulate your business from these shifts by platforming your agents now.

Action Plan for Leaders

  • Audit your current AI bills: If you have one vendor, you are losing money.
  • Stop the Hard-Coding: Demand that your IT teams use an abstraction layer so you aren’t tied to a model family.
  • Optimize for Outcomes, Not SAT Scores: Test your models against your actual business records, not humanities tests.
  • Deploy a Conductor: Use a platform like Krista to handle the dynamic switching between models based on speed, cost, and risk.

The companies that win will not be the ones with the best model. They will be the ones with the best orchestration.

Links and Resources

Speakers

Scott King

Scott King

Chief Marketer @ Krista

John Michelsen

Chief Geek @ Krista

Chris Kraus

VP Product @ Krista

Transcription

Scott King

Well hey everyone, thanks for joining this episode of The Union Podcast. We’re back from a break, but it was well worth it since we got some really cool LLM testing results to walk you through today. Joined by the usual, John and Chris. Hey guys, how’s it going?

Chris Kraus

Hey Scott.

John Michelsen

Very well.

Scott King

John, I like your shirt today. It’s very on brand. Over the past couple of months, we tested 28 different LLM models, both closed source and open source, against all types of records. The title of the report is “Breaking the Celebrity LLM Monopoly.” John, walk me through why you call it the celebrity and why do we feel it’s necessary to call out this decision-making based on a popularity contest.

John Michelsen

Thanks Scott. It really is an incredibly important decision that businesses are making. It has a lot to do with whether they’re going to ROI these projects. Yet it is a popularity contest. A lot of what we hear these days about the most popular, most commonly known models is more tabloid sounding than science or tech sounding. I’m living in a world I’m not used to. I used to turn off all the noise that comes out of Hollywood. Now I’ve got to study it. That irks me. That’s not what we should be doing right now. When you think about the behavior of the CEOs of two of these biggest ones recently, not willing to shake hands in front of the prime minister of a huge country, you realize there’s a lot more immaturity in these organizations.

This is a very critical time. These are very valuable pieces of technology doing incredible things. But the noise from the celebrity level of hype is drowning out the true story about how reasonably priced, more reasonably built models are doing phenomenally good work. We’ve got to embrace the entire network of models and their capabilities as we try to use these in a business context and make an ROI work.

Scott King

Picking one is dangerous, especially if it turns out to be too costly. From a technical perspective, Chris, when a company defaults to one celebrity model, what are they actually sacrificing?

Chris Kraus

We’ve discussed this probably two or three times over the last couple of years. There’s a natural affinity where a developer picks a model, hard codes APIs, and starts using it. The catch-22 is the APIs for different large language models are different. If you’ve hard coded them, once one becomes deprecated, it’s a code change to fix them. If you want to change to a different model because it is more efficient or your current one is not providing better answers, it’s a development project. You rewrite your code and go through full test cycles.

We’ve always talked about the value of Krista being plug and play with multiple models. We hide the intricacies of how you handshake and interact with the model so you can change them. Cost is becoming a big burden for people. Neural models are just as powerful as the first models we got two years ago. They’re doing a much better job but they’re a lot less expensive. Models are being deprecated on a regular basis. When they’re hard-coded, everything has to change. It prevents people from thinking about how to solve the problem because they are chasing API changes in the background. It’s moved to a technical level versus the business level.

Scott King

It’s very anti-Henry Ford. John, benchmarks are often outside of a real business context. How well does it perform on an SAT? Why aren’t those benchmarks a good measure for business use?

John Michelsen

They may be a good barometer for a very small percentage of use cases, but they’re not the necessary skill level for every activity. We’re running around trying to get the absolutely best humanities exam winner and then we’re asking it to get an address out of an email buffer. We don’t need that. We need to think about GenAI models the same way we do employee capabilities. We employ a variety of skill levels at a variety of costs to form the business with the proper set of capabilities. We cannot hire the most expensive people and give them the most menial tasks. They will refuse to do the job and we won’t operate effectively.

That’s exactly what’s happening. You grab the most expensive celebrity model. A developer thinks they want all the good luck they can get, so they take the most expensive. That’s not how you deliver ROI. Every single thing that becomes a prompt to an LLM gets evaluated on speed, accuracy, cost, and risk. Only then we pick a model. Just recently, one celebrity model claimed it is more capable than human capacity. I was doing a little marketing work and needed a word that starts with the letter C that means fast. The model said “quick.” So when we hear suspicions that these guys are juicing the tests, there could be a lot going on there. It’s not just a handful of celebrity model makers doing the work. It’s an entire group that deserves focus because they’re doing work that we can actually ROI.

Scott King

Building an application extracting data from invoices is something nobody’s going to do via a chat window. Chris, we ran thousands of business records through these 28 LLMs. What were the test records and why did we do it that way?

Chris Kraus

We took our experience deploying into customers. Our customers ask us to automate end-to-end business processes like order processing or customer service. We looked at different use cases like data processing. Can the model determine a date and look at sentiment? Then we get into cases where the workflow changes based on what’s to be done. Is it a return or an RMA because of a warranty issue? Can we detect those things and create proper JSON documents for a workflow engine? It’s different than trying to rewrite a sentence to cheat on homework.

We found you need a different skill set with Retrieval-Augmented Generation (RAG). If people need to look up an HR policy among 15 manuals, it’s a combination of retrieval and generation. Summarizing and answering based on five significant paragraphs is a different skill than document understanding. Bigger definitely is not better. Sometimes the bigger model is slower because it speaks seven languages. If you only need one, you’re not doing yourself favors going to trillion-parameter models. The bigger model is not necessarily more accurate. Sometimes it’s the skill of the model. That’s what we look to identify for the customers to make sure we’re giving accurate, fast answers.

Scott King

John, can you walk us through the idea of “good enough”? Customers ask for 100% accuracy. What is the difference between 100% and good enough in cost and skill?
John Michelsen: In a genuine business context, you never have 100% certainty of anything. Your team doesn’t have 100% output. I have yet to work with a customer who said it has to be perfect before we can ship it. We always have an awkward moment of education. Have you evaluated the accuracy of the way you currently do it now? Those capable of doing so usually find even an initial deployment is meeting or beating that metric. It gets better from there because there are solution-level capabilities that use a continuous learning process.

Deep within the Krista platform, there is a notion of a prompt profile. When we invoke a model, speed, cost, accuracy, and risk are taken into account. We optimize on accuracy versus cost or performance. If you’re doing a product catalog chatbot, accuracy may not even be in your documents. How precise does it have to be when you’ve prompted it to sell your stuff? You might need speed more than accuracy. It needs to be a dynamic decision. Most are not doing this.

OpenAI lost 12 billion dollars in Q3 of 2025. When you think about that kind of cash disappearing, they’re either praying for a cheap GPU or they expect you to pay a significantly higher amount later. You are the product if you are not paying for it. OpenAI needs to crowd out everyone else, become the most popular celebrity, and own the market so they can write the economics. We need to be focused on where other models can do the workload. Monopolies are not an effective economic model for consumers.

Scott King

In the report, I took a celebrity model and an open-source one. The quality was tenths of percentage points difference, but the celebrity model was 15 times more expensive and six times slower. How do you dynamically switch these things?

John Michelsen

You provide a framework of technology access to many of them. In Krista’s case, we do most of the prompt engineering from the software. We don’t say “here’s a big white box, see if you can get what you need.” Our mission is to make this a non-issue for business people. If the solution knows what it’s trying to accomplish, it constructs that prompt profile.

Scott King

John, walk us through the story of the $1,500 chatbot.

John Michelsen

This is a true story. We discovered a chatbot in active use with over $1,500 a day in LLM charges. A few weeks later, that’s a Krista deployment with less than $15 a day of LLM charges. That chatbot would have to clear over $500,000 of value just to cover the third-party AI cost. You’re not going to ROI many projects that way.

Scott King

Chris, if someone has a big AI bill, what is the first audit they should do?

Chris Kraus

If you’re using just one, you’re doing it wrong. You need five or six. Within Krista, we use a classifier—a traditional machine learning model—to classify if this is a complex query, a math problem, or data extraction. If you only have one bill, it is too high because you’re not shopping correctly.

John Michelsen

If I tried to one-line it: you have to be thoughtful about your use of GenAI just like you are thoughtful about how you deploy all the talents in your organization.

Chris Kraus

Whatever you figured out today is the wrong answer in three weeks. This is a continuous exercise. We go through this once a month. Rankings change because everybody’s getting better at certain skills.

Scott King

Appreciate it, guys. Until next time.