There are no hallucinations. Only probabilities.
What we call a "hallucination" is not a glitch, a lie, or a failure of reasoning. It is the mathematically inevitable consequence of sampling from a probability distribution.
Every time a language model generates text, it is not “thinking” in any meaningful sense. It is computing a probability distribution over its entire vocabulary, tens of thousands of tokens, and then sampling from that distribution. The word “Madrid” doesn’t come out because the model “knows” it’s the capital of Spain. It comes out because, given all the preceding tokens, the model assigned it a very high probability.
This distinction matters enormously. When a model says something false, when it invents a citation, misremembers a date, or confabulates a biography, it is not lying. It is doing exactly what it was designed to do: pick the statistically likely next token.
The problem is that statistical likelihood and factual truth are not the same thing.
The AI industry has decided to call this gap a “hallucination.”
That word choice is not neutral. It implies a failure of perception, a momentary lapse, something that happens to an otherwise sound mind. It is the perfect term for an industry that wants to acknowledge the problem just enough to seem responsible, while obscuring what’s actually going on.
Welcome to the inside of the world’s most elaborate magic trick which is really just numbers.
The softmax: where all tokens compete
Before a model produces any word, it converts raw logit scores into a probability distribution using the softmax function. Every token in the vocabulary gets a probability, and those probabilities sum to exactly 1. Nothing is ever ruled out entirely some tokens are just astronomically unlikely.
# Softmax: converts raw scores → probabilities
# P(token_i) = exp(z_i) / Σ exp(z_j) for all j in vocabulary
import numpy as np
def softmax(logits):
e = np.exp(logits - np.max(logits)) # subtract max for numerical stability
return e / e.sum()
logits = np.array([4.2, 2.1, 1.8, 0.5, -0.3])
probs = softmax(logits)
# → [0.723, 0.097, 0.073, 0.032, 0.015] ← sums to ~1.0
# The top token wins 72% of the time.
# But 28% of the time, something else comes out. Every single generation.This is the foundation. The model never decides anything. It draws from a distribution. On factual questions where the training data is rich and consistent, the correct answer sits at 0.95+ probability and almost always wins. On questions at the edge of its training like obscure facts, recent events, niche domains for example, the distribution flattens, and the gap between the right answer and a plausible-sounding wrong answer narrows to nothing.
Entropy: measuring what the model doesn’t know
Information theory gives us a precise way to measure how uncertain a distribution is. Shannon entropy quantifies exactly how “spread out” the probability mass is and therefore how likely the model is to produce something unexpected.
# Shannon entropy of a discrete distribution
# H(P) = -Σ P(x) · log₂ P(x) for all x with P(x) > 0
def entropy(probs):
probs = np.array(probs)
probs = probs[probs > 0] # avoid log(0)
return -np.sum(probs * np.log2(probs))
# A deterministic distribution — the model is certain:
entropy([1.0, 0.0, 0.0]) # → 0.0 bits (no uncertainty)
# A fully uniform distribution — the model has no idea:
entropy([0.25, 0.25, 0.25, 0.25]) # → 2.0 bits (maximum uncertainty)
# A realistic model output on a well-known fact:
entropy([0.72, 0.10, 0.08, 0.06, 0.04]) # → 1.34 bitsHigh entropy doesn’t mean the model is broken. It means the model is genuinely uncertain, which is often the correct epistemic state. The problem arises when a model generates confidently despite high entropy in its internal distribution. Think about all those research papers that look right but are nowhere near reality.
That gap between expressed certainty and internal uncertainty is where hallucinations live.
And here is the first thing the industry doesn’t want to say plainly: the model always has the information it needs to flag its own uncertainty. It just isn’t trained to use it.
The chain rule: how small errors become confident nonsense
Language models generate text one token at a time, each conditioned on all prior tokens. This means probabilities multiply. A sequence of individually plausible tokens can produce a collectively improbable and factually wrong sentence.
I usually get a hard time when I call any LLM a “stochastic plausibility engine” but that’s exactly what it is.
# Joint probability of a sequence via the chain rule:
# P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... · P(wₙ|w₁,...,wₙ₋₁)
import functools, operator
# If each token has 90% "correct" probability over 10 tokens:
token_probs = [0.90] * 10
joint_prob = functools.reduce(operator.mul, token_probs, 1)
# → 0.349 (only 35% chance the full sentence is "correct")
# At 80% per token over 20 tokens:
joint_prob_20 = 0.80 ** 20
# → 0.012 (1.2% chance — almost certainly contains an error)
# At a seemingly-generous 95% per token over 200 tokens:
joint_prob_200 = 0.95 ** 200
# → 0.000035 (0.004% — virtually guaranteed to contain an error)This is why longer answers are more likely to contain errors. It is not sloppiness. It is arithmetic. Every sentence you add to a prompt response is another multiplication by a number less than one. The joint probability of a perfectly accurate 200-token paragraph approaches zero regardless of how good the model is on any individual token.
The industry presents this as a problem of model capability something that will be solved with more parameters, more data, more compute. It will not. The chain rule is not a training artifact, it is mathematics.
Perplexity: the model’s own uncertainty score
Perplexity is the standard metric for how “surprised” a model is by a sequence. It is the exponentiated cross-entropy and it maps directly onto how probable the model considers the text to be.
# Perplexity of a sequence
# PPL = exp(-1/N · Σ log P(wᵢ | w₁,...,wᵢ₋₁))
def perplexity(probs):
"""probs: list of per-token probabilities assigned by model"""
N = len(probs)
log_sum = sum(np.log(p) for p in probs)
return np.exp(-log_sum / N)
# A confident model on familiar text:
perplexity([0.92, 0.88, 0.91, 0.87, 0.90]) # → ~1.12 (very low)
# Model generating a fabricated citation:
perplexity([0.71, 0.43, 0.38, 0.52, 0.29]) # → ~2.54 (higher, but not dramatic)
# The crucial insight: hallucinated text often has LOW perplexity.
# The model generates fabrications smoothly, confidently, without hesitation —
# because they fit the statistical patterns of training data perfectly,
# even if no such fact exists in reality.This last point deserves to be said slowly. The most dangerous hallucinations are the ones the model is most confident about. A model that fabricates a plausible-sounding paper title in the style of an academic citation will assign that fabrication a lower perplexity than an awkward but accurate paraphrase of reality. Statistical fluency and factual accuracy are not just different things they are sometimes in direct tension.
Temperature scaling: the knob nobody explains honestly
Modern models use temperature scaling to control the shape of the output distribution. This is presented to users as a creativity dial. “Low temperature for factual tasks, high temperature for creative ones.” That framing is true but incomplete and the part that gets omitted is revealing. And the Linkedin experts get really upset when I point it out.
# Temperature scaling: reshape the logit distribution
# P_T(token_i) = softmax(z_i / T)
#
# T < 1.0 → sharper (model becomes more deterministic)
# T = 1.0 → standard distribution
# T > 1.0 → flatter (more creative / more hallucinatory)
# Nucleus (top-p) sampling: only sample from top probability mass
def top_p_sample(probs, p=0.9):
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
nucleus = sorted_idx[cumsum <= p]
nucleus_probs = sorted_probs[:len(nucleus)]
nucleus_probs /= nucleus_probs.sum() # renormalize
return np.random.choice(nucleus, p=nucleus_probs)What temperature actually controls is the hallucination rate. At temperature 0.1, the model almost always picks the top token. At temperature 1.5, the fifth or sixth most likely token wins regularly. The creative outputs users get at high temperature come at a direct cost: the probability mass is no longer concentrated on factually accurate tokens.
The API defaults used by most applications, temperatures around 0.7 to 1.0, top-p around 0.9 are not chosen because they minimise errors. They are chosen because they produce text that reads well. Fluent text, as it turns out, is not the same as accurate text.
The industry has largely optimised for the former while marketing the latter.
What this actually means and what won’t fix it
The standard narrative is that hallucinations are a bug on the path to a fixed product. Train on more data. Scale the model. Apply RLHF. Add retrieval. The problem will shrink until it’s negligible.
This narrative is wrong in a specific and important way.
More training data sharpens the distribution on things that are well-represented in the training corpus. It does nothing for the obscure facts, the niche domains, the things that happened after the cutoff date, the things that were never written down at all. The distribution will always have a long tail. The chain rule will always turn small per-token error rates into large sentence-level error rates. Perplexity and accuracy will always be different things.
What can change, and what the serious research is actually about, is whether models communicate their uncertainty honestly. Calibration research asks: when a model says it is 90% confident, is it right 90% of the time? For most current models the answer is no. They are systematically overconfident. RLHF, which trains models on human preference, makes this worse, because humans prefer confident-sounding answers to hedged ones, and the training signal rewards confidence regardless of accuracy.
Retrieval-augmented generation helps, but it relocates the problem rather than solving it. The model still has to decide when to retrieve, what to retrieve, and whether the retrieved content is relevant. Each of those decisions is a probabilistic one.
The honest version of where we are: these systems will always produce statistically likely outputs, and statistically likely outputs will sometimes be wrong. The question is whether the output signals when it might be wrong, and right now, for the most part, it doesn’t. The model has access to its own entropy at every token. It knows, in a mathematical sense, when it is uncertain. It is almost never trained to tell you.
That gap, between what the model knows about its own uncertainty and what it communicates to you, is not a technical limitation. It is a product decision. And it is the decision that makes hallucinations genuinely dangerous, rather than merely occasionally inconvenient.
The word “hallucination” was always a category error. A hallucination implies a mind that has temporarily lost contact with reality.
What’s actually happening is simpler and more tractable: a probability distribution being sampled in a region where it isn’t reliable, by a system that was never properly incentivised to say so.
Rename it and the problem becomes clearer. It’s not a hallucination. It’s an uncalibrated sample from a high-entropy distribution, presented without the uncertainty estimate it was always carrying.
Fix the incentive. Require the uncertainty estimate. The rest is arithmetic.

