A lot of attributes of large language models (LLMs) like ChatGPT or Claude make people nervous, especially when thinking about using them in contexts like hiring or analyzing medical records.
I'm not trying to convince you these concerns aren't valid. In fact, I think some of these should probably worry you more!
But there's one characteristic of LLMs that gets more attention than it deserves, and I often hear it cited as a reason we shouldn't let these models near anything important: non-determinism. Non-determinism refers to the fact that if you ask an LLM like GPT-4 or Claude the exact same question multiple times, you might get different answers each time. Sometimes the answers might only vary slightly, but other times they could be completely different.
Why Non-Determinism Makes People Uncomfortable
I think non-determinism gets under people's skin for several reasons:
It raises fairness concerns. If I apply for a loan on Monday and you apply with identical information on Tuesday, we expect the same outcome. The idea that an AI might approve me and reject you feels fundamentally unfair.
It makes testing seem impossible. How do you know if your system is working "correctly" if it might do something different each time you run the same test?
It violates our basic expectations of computers – they're supposed to be reliable and consistent! When a calculator gives different answers to the same equation, we don't think it's being creative, we think it's broken.
I've sometimes given a somewhat flip answer to this concern: "Well, are humans deterministic?" Think about all the important processes we trust to people—screening job applications, making medical diagnoses, or deciding legal sentences. The outcomes of these processes depend not just on which HR person, doctor, or judge you happen to get, but also on factors like if they just had a meal break or the length of their shift.
But that's not satisfying. Most other machine learning models are consistent—they will give you the exact same output every time you give them the same input. And expecting more consistency from computers than people doesn't seem unreasonable.
I think this deserves a better answer. So here's what I'll cover in this post:
The Gumball Machine: A Mental Model for LLM Outputs
Non-Determinism Shouldn't Stop Us From Using LLMs in Important Contexts
LLM Qualities That Concern Me More
The Gumball Machine: A Mental Model for LLM Outputs
When you ask an LLM a particular question, it's like you've got a gumball machine and you can't see what's inside. You put in your coin, turn the handle, and something comes out.
Let's use a simple example: imagine you're asking a classification problem– a question where the model must select from a limited set of options like "yes" or "no." In this case, there are only two types of gumballs in our machine, the yes ones and the no ones–and in this scenario, 70% of the gumballs say "yes" and 30% say "no."
This doesn't mean "yes" is 70% likely to be the correct answer—it doesn't necessarily tell us anything about correctness! It just means that if you keep asking the same question over and over, about 70% of your answers will be "yes."
What about for questions that are generative in nature – that is, where the model needs to create new content rather than choose from predetermined categories? In that case, there might be a much wider set of possible gumballs to select from. But it's still 'bounded' – there's never an infinite set of unique gumballs in the machine. For instance, if you ask repeatedly for a list of pasta recipe ingredients, you'll never get "arsenic" or even "cheetos".
In fact, for some questions, there might actually just be one type of gumball–try asking some models, over and over again, to fill in a word to complete the phrase “better late than __”, and you'll get the same answer, over and over.

Gumball machines aren't a perfect comparison, because with real gumball machines, once you take out a gumball, it's gone. But with LLMs, it's like the machine never runs out—each answer is always available to be selected again with the same probabilities as initially.
This randomness isn't typical in machine learning1. Many models give you probabilities (like "70% chance of rain") or other measurements of uncertainty, but they'll give you the same prediction every time with the same input if you want it. Put your coin in, get the same answer out each time.
So this variability is a feature of LLMs that you're unlikely to run into with other text analytics tools. Seems bad! But here's why I don't think it is, and why I don't think we should be wary of using them for important things—at least not for this reason.
Here’s why.
Flagging Text Is Different From Making Decisions
When we talk about using LLMs in the context of hiring or analyzing medical records, we're often not talking about letting the model make a decision. It's not actually choosing a candidate or determining a mortgage rate. Instead, we're using it to flag important information about a broader document or set of documents.
For example, the LLM might:
Highlight relevant technical experience in a job application so a human reviewer can spot it more easily
Suggest the presence of billing fraud and point out parts of a medical record that support this
In both cases, a human is still involved in the process—in the first example, the LLM isn't even making recommendations, just helping surface information.
The standards we set for LLMs in these supporting roles should be different from what we'd require if the LLM were making decisions on its own. When an LLM is highlighting text rather than making judgments, we care more about whether useful information is surfaced than whether the exact same information is highlighted every time. A human reviewer can recognize relevant information regardless of slight variations in what gets flagged: that doesn't seem unfair or like it undermines the process in the same way that it does for decision-making.
Being Consistent Isn't the Same as Being Right
Going back to our yes-or-no gumball example, it would be incredibly easy to build a system that's perfectly consistent: just program it to always say "no" to every question. But that wouldn't be very useful!
We prefer systems that give correct answers most of the time, even if they're slightly inconsistent. Accuracy and consistency are different qualities. The reason to use an LLM instead of some other model is of it does something better—it might be more accurate or more affordable. It might be great at handling routine questions quickly so people can focus on more complex issues. These qualities can, in different scenarios, trade off to varying degrees with non-determinism.
That is, even though we generally prefer more consistency rather than less, we might accept some variability if it means getting significantly more accurate responses.
We Can Make Systems Using LLMs More Consistent When Needed
If you've only used LLMs through graphical interfaces like the ChatGPT website, you might think these systems are more inconsistent than they actually are when used in narrow, specific applications.
This is because, when consistency is important, developers can adjust settings to make the LLM more predictable. One of these settings is temperature, which controls how variable the LLM's responses are. Using our gumball machine analogy, temperature is like having a dial that changes how the selection works. At high temperature settings, the rare ones may be selected. At lower temperatures, the machine is more likely to select the most common gumball every time.
But for more control, especially with open-source LLMs (models whose code and weights are publicly available and can be run on an organization's own computers, unlike proprietary models like GPT-4), developers can also:
Set a random seed (think of this as pre-determining which gumball will get picked first, second, etc.)
Make specific hardware choices that affect how calculations are performed.
Work directly with the LLM's raw prediction scores (called logits)—these are the model's internal confidence ratings for each possible answer2.
Another technique for classification problems is to ask the LLM the same question multiple times and go with the most common answer. This might get more accuracy, or less—you can test it for your problem and see—but it will get you more consistency.
And for tasks like flagging which text to highlight, you can ask the same question multiple times and flag all of the text that has ever been returned, or just the text that always gets returned, for more consistency.
LLM Qualities That Concern Me More
These are the things that actually bother me about deploying LLMs in important settings—or at least explain why I think they need a lot of testing and monitoring.
You Can't Look Under the Hood
With many traditional machine learning systems, you can trace why they made a particular decision. For example, a loan approval system might show that a high income pushed toward approval to a certain degree, while a recent late payment pushed toward rejection.
But with large language models, we don't have this visibility or explainability yet3. We can't point to specific parts of the model and say, "This is why the model predicted this patient needs this treatment."
How do you audit a decision when you can't fully explain how it was made? How do you know if you've successfully tweaked it to make better decisions in the future, particularly if the input is slightly different next time?
This lack of transparency makes it especially tricky to use LLMs in production systems where we can't predict what data people will input. Since we don't understand their internal processing, it's hard to anticipate how small changes in inputs might produce dramatically different outputs. And while this problem of handling novel data isn't unique to LLMs, their opacity makes it particularly challenging to predict what features the model will consider important when generating its response and therefore how to tell if the new data you're getting is different in ways that matter from the old data.
There are answers to this—mostly involving a lot of testing—but it's not obvious or easy, and this is also worse than with many other types of models.
How Do You Measure "Right" When There's No Single Right Answer?
A lot of the use cases for LLMs are generative tasks where the model creates new content rather than selecting from predefined options.
With traditional models, automated scoring is super straightforward—you know exactly what the right answer should be and can compare the model's output against it. But with generative LLM tasks, automated evaluation becomes challenging because there's no simple way to automatically determine if a response is "correct".
Instead, you end up needing humans to judge quality, which is expensive and time-consuming. Or you use other LLMs to evaluate responses, which can be very effective and is best practice for certain types of evaluations, but also adds another layer of models to evaluate.
We Can Deal With Non-Determinism. Really.
When I talk to people implementing LLMs in real systems, non-determinism rarely comes up. They're worried about issues like dependency on models owned by external organizations and the challenge of scaling demos to handle more varied, real-world data. And when I’m building something with these, my first concern is accuracy—because if we can't get it accurate enough, the rest of it doesn't matter.
Let's go back to the objections I opened with:
Fairness: Many real-world uses aren't about making final decisions—they're highlighting text for review, triaging documents, or making a guess but ultimately asking you to make the decision. When humans still review and make the choice, the stakes are different, and so are the consistency standards we need.
Testing: Non-determinism complicates testing but not actually that much. The bigger testing headaches come from lack of explainability and the difficulty of evaluating if generated text is "right."
Computers shouldn’t work that way: Maybe not! And it would be better if we could make them deterministic much more easily. But for now, this is the trade-off we're making to get systems that can handle language at this level.
But also, when we do need to make LLMs more deterministic, we have ways to do it—temperature at zero, fixed seeds, repeated sampling. The non-determinism might actually be tiny or non-existent in practice for your specific problem and set-up. The non-determinism issue for LLMs does not always need to be solved in order to use these in applications that matter—but when it does, we can (mostly) do it.
For certain types of models, there is some randomness that is less central to the model and easy to control. There are other models with additional sources of non-determinism, but most data scientists will never encounter them.
In our gumball example, instead of just getting a "yes" or "no," you'd see that the model thinks there's a 70% chance of "yes" and 30% chance of "no." This is also possible with proprietary models, but it's not quite as effective, because the raw prediction scores for proprietary models are also non-deterministic.