A digital twin is just a model. And your model is wrong.
The digital twin market is projected to reach $110 billion by 2028. It is also, in large part, a rebranding exercise for technology that already existed, applied to problems it cannot actually solve,
The pitch is seductive. Build a perfect digital replica of a physical system. Feed it live sensor data. Watch it mirror reality in real time. Use it to predict failures before they happen, optimise operations without touching the physical asset, run simulations of future states and make decisions accordingly.
This pitch has been made, with minor variations, to manufacturing firms, hospital systems, city planners, energy companies, and militaries. The market analysts have assigned it a number with nine zeroes. The consulting firms have built practices around it. The vendors have given it a name that implies science fiction levels of fidelity: a twin. Not a model, not a simulation, not a dashboard. A twin. A perfect double.
There is a word for what most deployed digital twins actually are. That word is model. And models, as any engineer who has spent time with them knows, are wrong. The question is not whether your model is wrong. It is how wrong, in which directions, and whether anyone is accounting for that when they make decisions based on it.
The digital twin industry has largely decided not to engage with this question. That decision is costing buyers real money for capabilities that are, in the precise technical sense, not what was advertised.
What a digital twin actually is
Ask ten vendors what a digital twin is and you will get ten answers. This is not because the concept is subtle. It is because the term has been deliberately kept vague enough to apply to almost anything, which maximises the addressable market.
The original definition, from Grieves at the University of Michigan in 2002, was specific: a digital twin is a virtual representation of a physical object or system, connected to that object via data, updated continuously, and used to understand and predict the behaviour of the physical counterpart. The connection and the continuity were load-bearing. Without them, you have a model. With them, you have something potentially more useful.
What the market has since defined as a digital twin includes: static 3D CAD models with a live dashboard bolted on. BIM files for buildings that were last updated at construction. Simulation models that run periodically, not continuously. Dashboards that aggregate sensor data into visualisations without any underlying physics model at all. In several high-profile cases, what a vendor called a digital twin was, on inspection, a database with a 3D viewer.
The definitional collapse is not accidental. Once you allow “digital representation of a physical thing” to count as a twin, the installed base explodes overnight. Every CAD file ever created becomes a digital twin. Every BMS system in every office building becomes a digital twin. The market is large because the definition has been made large to fit it.
The actual thing, a continuously updated, data-connected representation that meaningfully tracks a physical system is much rarer, much harder, and much more expensive than the market narrative suggests.
The model drift problem
A digital twin is only useful if it reflects the current state of its physical counterpart. The moment it diverges, it stops being a twin and starts being a historical record. Keeping it synchronised requires continuous data ingestion, model updating, and crucially, a mechanism for detecting and correcting when the digital and physical have come apart.
This is a Kalman filter problem. The Kalman filter is the standard tool for maintaining a running estimate of a system’s state given noisy measurements. It has two steps: predict the next state using the model, then update that prediction using the actual measurement. The update weight -- how much you trust the measurement versus the model -- is determined by the relative uncertainty of each.
import numpy as np
def kalman_update(x_est, P_est, measurement, H, R, F, Q):
"""
One step of a Kalman filter.
x_est: current state estimate (n,)
P_est: current state covariance (n, n)
measurement: observed sensor reading (m,)
H: observation matrix (m, n) maps state to measurement space
R: measurement noise covariance (m, m) how much we trust sensors
F: state transition matrix (n, n) the physics model
Q: process noise covariance (n, n) how much the model drifts
The critical ratio: Q / R
High Q, low R: trust the sensors, distrust the model (model is bad)
Low Q, high R: trust the model, distrust the sensors (sensors are bad)
"""
# Predict step: where does the model say we'll be?
x_pred = F @ x_est
P_pred = F @ P_est @ F.T + Q
# Kalman gain: how much to correct toward the measurement
S = H @ P_pred @ H.T + R # innovation covariance
K = P_pred @ H.T @ np.linalg.inv(S)
# Update step: correct the prediction with the measurement
innovation = measurement - H @ x_pred # how wrong was the model?
x_updated = x_pred + K @ innovation
P_updated = (np.eye(len(x_est)) - K @ H) @ P_pred
return x_updated, P_updated, innovation
# Concrete example: tracking a turbine bearing temperature
# State: [temperature, rate_of_change]
# Measurement: raw thermocouple reading
n_steps = 100
np.random.seed(42)
# True system: temperature rising slowly with noise
true_temp = 80.0
true_state = np.array([true_temp, 0.1]) # [temp, rate]
# Model parameters (what the digital twin thinks the physics is)
F = np.array([[1.0, 1.0], # temp += rate * dt
[0.0, 1.0]]) # rate stays constant (simplified)
H = np.array([[1.0, 0.0]]) # we only measure temperature, not rate
# The critical question: how wrong is the model?
Q_good = np.diag([0.1, 0.01]) # model is trusted: low process noise
Q_bad = np.diag([2.0, 0.5]) # model is not trusted: high process noise
R = np.array([[1.5]]) # sensor noise covariance (thermocouples are noisy)
# Run both scenarios
x_est_good = np.array([80.0, 0.0])
x_est_bad = np.array([80.0, 0.0])
P = np.eye(2)
drift_good, drift_bad = [], []
for step in range(n_steps):
# Simulate physical system drifting from model assumptions
true_state[1] += 0.005 * step / n_steps # rate is actually accelerating
true_state[0] += true_state[1]
measurement = np.array([true_state[0] + np.random.randn() * 1.2])
x_est_good, P, innov_good = kalman_update(
x_est_good, P, measurement, H, R, F, Q_good)
x_est_bad, P, innov_bad = kalman_update(
x_est_bad, P, measurement, H, R, F, Q_bad)
drift_good.append(abs(x_est_good[0] - true_state[0]))
drift_bad.append(abs(x_est_bad[0] - true_state[0]))
print(f"After {n_steps} steps:")
print(f" Twin with trusted (but wrong) model -- mean drift: {np.mean(drift_good):.2f} deg C")
print(f" Twin with distrusted model -- mean drift: {np.mean(drift_bad):.2f} deg C")
print(f"")
print(f" The twin that trusted its model drifted {np.mean(drift_good)/np.mean(drift_bad):.1f}x more.")
print(f" This is the core problem: vendors tune Q to make demos look good.")
print(f" In production, the model is always wrong in ways the demo didn't include.")
# Typical output:
# After 100 steps:
# Twin with trusted (but wrong) model -- mean drift: 4.73 deg C
# Twin with distrusted model -- mean drift: 1.21 deg C
# The twin that trusted its model drifted 3.9x more.The Q matrix is where digital twin projects go to die. Q encodes how much you trust your own physics model. Set it too low, we trust the model too much, and when the physical system drifts from the model’s assumptions, the twin lags behind and eventually diverges. Set it too high, distrust the model, and you are essentially just filtering the sensor data, and there is no need for the physics model at all.
Most vendors set Q in their demos by tuning it to match their demo dataset. In production, the physical system behaves differently than the demo. The model is wrong in new directions. The twin drifts. The maintenance team stops looking at the twin dashboard because it is wrong too often to trust. This is not a hypothetical. It is the modal outcome of digital twin deployments in discrete manufacturing.
The sensor data problem nobody prices in
The Kalman filter above assumes you have measurements. Good measurements, arriving on time, from sensors that are working. This assumption is doing a remarkable amount of heavy lifting in every digital twin business case I have ever seen.
def assess_sensor_quality(sensor_readings, expected_range, max_gap_seconds):
"""
The data quality audit that should precede every digital twin business case.
Almost never actually done before signing the contract.
"""
readings = np.array(sensor_readings)
timestamps = readings[:, 0]
values = readings[:, 1]
# 1. Completeness: what fraction of expected readings actually arrived?
gaps = np.diff(timestamps)
missing_pct = (gaps[gaps > max_gap_seconds] / max_gap_seconds - 1).sum()
completeness = max(0, 1 - missing_pct / len(gaps))
# 2. Range validity: sensors fail in characteristic ways
in_range = ((values >= expected_range[0]) & (values <= expected_range[1]))
range_valid = in_range.mean()
# 3. Stuck sensor detection: a common failure mode where sensor reads same value
diffs = np.diff(values)
stuck_windows = (np.abs(diffs) < 0.001).sum()
stuck_pct = stuck_windows / len(diffs)
# 4. Latency: is the data arriving fast enough to be useful?
# For a fast-moving system, 30-second old data is not "real time"
median_latency = np.median(np.diff(timestamps))
print(f"Sensor data quality report:")
print(f" Completeness: {completeness:.1%}")
print(f" Range validity: {range_valid:.1%}")
print(f" Stuck sensor windows: {stuck_pct:.1%} of readings")
print(f" Median update rate: {median_latency:.1f}s")
print(f"")
if completeness < 0.95:
print(f" WARNING: {(1-completeness):.1%} of readings missing.")
print(f" The twin will interpolate or hold-last-value.")
print(f" Neither is the physical system. It is a guess.")
if stuck_pct > 0.02:
print(f" WARNING: {stuck_pct:.1%} of readings look stuck.")
print(f" Stuck sensors read healthy while the system degrades.")
print(f" The twin will show green. The asset is not green.")
total_quality = completeness * range_valid * (1 - stuck_pct)
print(f" Combined data quality index: {total_quality:.1%}")
print(f" Your twin is {total_quality:.1%} of a twin. The rest is fiction.")
return total_qualityReal industrial sensor estates are not what the demo shows. They are accumulated over years, from multiple vendors, using different protocols, with varying calibration histories. Sensors fail silently: they stick at a value, they drift slowly, they develop intermittent faults that look like signal noise. The PLC that aggregates readings was not designed for the latency requirements of a real-time digital twin. The historian database that stores readings was set up to log at one-minute intervals, and nobody budgeted for upgrading it.
In 2017 a study of industrial IoT deployments by Cisco found that 74% of IoT projects fail or stall. The leading cause is not the software. It is data: missing data, bad data, data that arrives too slowly, data from sensors that have not been calibrated in three years. The digital twin sits downstream of all of this. It gets the data last, after it has been through the PLC, the historian, the network, and the middleware layer that converts it from the proprietary format the sensor vendor uses. By the time the twin sees a reading, it can be seconds to minutes old.
For a slow-moving system like a building, this is fine. For a fast-moving system like a turbine, a compressor, or a robot cell, it is not fine. “Real time” in most digital twin deployments means “real time relative to the historian update frequency,” which is not the same thing as real time.
The model fidelity trap
Suppose you solve the data problem. Sensors are good, data is clean, latency is acceptable. Now you need the physics model, the part that turns sensor readings into predictions about future states and failure modes. This is the part the vendors call “the digital twin” in their marketing materials, because it is the part that justifies the premium pricing.
Physics models exist on a spectrum from empirical (curve-fit to historical data) to first-principles (derived from governing equations). Both have problems.
def model_fidelity_tradeoffs():
"""
The three model types used in digital twins, and what each actually buys you.
"""
models = {
"Empirical (data-driven)": {
"build_cost": "Low -- fit a model to historical data",
"physics_required": "None",
"extrapolation": "POOR -- fails outside training distribution",
"calibration": "Automatic, but retraining needed as system ages",
"failure_modes": "Confidently wrong on novel operating conditions",
"honest_name": "Glorified regression",
},
"Physics-based (first principles)": {
"build_cost": "Very high -- requires domain experts, months of work",
"physics_required": "Complete governing equations",
"extrapolation": "Good within physical assumptions",
"calibration": "Manual, expensive, degrades as system wears",
"failure_modes": "Assumes ideal conditions that do not exist in production",
"honest_name": "Simulation that was calibrated once and then drifted",
},
"Hybrid": {
"build_cost": "Highest -- requires both expertise types",
"physics_required": "Partial",
"extrapolation": "Moderate",
"calibration": "Ongoing, requires both data and engineering review",
"failure_modes": "Inherits worst-case failure modes of both approaches",
"honest_name": "What vendors promise; rarely what is delivered",
},
}
for model_type, props in models.items():
print(f"\n{model_type}")
print(f" Honest name: {props['honest_name']}")
print(f" Extrapolation: {props['extrapolation']}")
print(f" Primary failure: {props['failure_modes']}")
print(f"""
The trap:
Vendors sell the hybrid vision.
They deliver empirical models because they are fast and cheap to build.
Empirical models fail outside their training distribution.
Industrial systems constantly operate outside their historical distribution
(new operators, new products, new wear patterns, new ambient conditions).
The twin fails precisely when you need it most: novel conditions.
""")
model_fidelity_tradeoffs()The first-principles physics model has its own problem: it was calibrated on a new machine, or a reference machine in a laboratory. The machine you actually have is not new. It has worn components, fouled heat exchangers, slightly misaligned shafts, and operating patterns that were not in the original design envelope. The physics model describes a platonic version of your asset. Your asset is not that.
Calibrating the model to your actual asset is a significant piece of engineering work, and it degrades over time as the asset changes. Most digital twin projects include an initial calibration. Almost none include a systematic recalibration process. Within 12 to 24 months, the model has drifted from the physical asset in ways that are not visible to the dashboard user.
This is the failure mode that never appears in case studies. Nobody publishes a paper titled “Our digital twin was accurate for 18 months and then quietly became useless.” The vendors move on to the next sale. The operations team keeps the dashboard on the wall because removing it would require admitting the project did not work.
What “predictive maintenance” actually predicts
The headline use case for digital twins in industrial settings is predictive maintenance: catch failures before they happen, reduce unplanned downtime, extend asset life. This is real. Condition monitoring and predictive maintenance genuinely work when implemented well.
The question is whether you need a digital twin to do it.
import numpy as np
from scipy import stats
def predictive_maintenance_reality_check(
sensor_history,
failure_events,
twin_predictions,
threshold=0.7
):
"""
Evaluate what a digital twin's predictive maintenance is actually doing
versus what simpler approaches would achieve.
"""
sensor_history = np.array(sensor_history)
failure_events = np.array(failure_events) # 1 = failure in next 30 days
twin_predictions = np.array(twin_predictions) # model's failure probability
# What the digital twin vendor reports: precision and recall at threshold
predicted_positive = twin_predictions >= threshold
true_positive = (predicted_positive & (failure_events == 1)).sum()
false_positive = (predicted_positive & (failure_events == 0)).sum()
false_negative = (~predicted_positive & (failure_events == 1)).sum()
twin_precision = true_positive / (true_positive + false_positive + 1e-9)
twin_recall = true_positive / (true_positive + false_negative + 1e-9)
# What a simple statistical rule gets you: flag when reading exceeds 2 std devs
mu, sigma = sensor_history.mean(), sensor_history.std()
simple_flag = sensor_history > (mu + 2 * sigma)
tp_simple = (simple_flag & (failure_events == 1)).sum()
fp_simple = (simple_flag & (failure_events == 0)).sum()
fn_simple = (~simple_flag & (failure_events == 1)).sum()
simple_precision = tp_simple / (tp_simple + fp_simple + 1e-9)
simple_recall = tp_simple / (tp_simple + fn_simple + 1e-9)
print(f"Predictive maintenance comparison:")
print(f"{'Method':<30} {'Precision':>10} {'Recall':>10}")
print("-" * 52)
print(f"{'Digital twin':< 30} {twin_precision:>10.1%} {twin_recall:>10.1%}")
print(f"{'Simple statistical rule':< 30} {simple_precision:>10.1%} {simple_recall:>10.1%}")
print(f"")
print(f" The question to ask your vendor:")
print(f" What is the marginal improvement over a simple threshold rule?")
print(f" And does that margin justify the implementation and maintenance cost?")
print(f" In most published evaluations, the margin is single-digit percentage points.")
print(f" The cost difference is an order of magnitude.")The published literature on digital twin predictive maintenance is full of precision and recall numbers that look impressive in isolation. They are rarely shown against the correct baseline, which is: what does a simple statistical threshold on the same sensor data achieve? In most cases, the answer is within a few percentage points of the twin’s performance. In some cases, the simple rule beats it.
This matters because the simple rule costs almost nothing to implement and maintain. The digital twin costs millions to build, requires specialist skills to maintain, and degrades silently when the underlying model drifts. The ROI case, which almost always computes the benefit of predictive maintenance against the cost of unplanned downtime, uses the twin’s performance numbers in the numerator. It rarely subtracts the cost of the wrong alerts, the staff time spent investigating false positives, or the gradual erosion of trust in the system that leads operations teams to ignore it.
The coordination problem nobody mentions
There is a failure mode that is not technical at all, and it is the most common reason digital twin projects deliver nothing.
A digital twin produces predictions. Those predictions are only valuable if someone acts on them. Acting on them requires: a workflow that delivers the prediction to the right person, authority to act on the prediction, spare parts if the prediction calls for maintenance, maintenance staff with the right skills, a maintenance window that does not conflict with production commitments, and a feedback loop that records whether the action was taken and whether it was correct.
None of this is in scope for the typical digital twin implementation. The vendor builds the model, integrates the data, deploys the dashboard, runs the pilot, shows the predictions are accurate, and hands over. The operational changes required to act on those predictions are left as an exercise for the buyer.
The result is a dashboard that accurately predicts failures that nobody has the workflow to prevent. The twin is technically working. The benefit is zero.
This is not a criticism of any specific vendor. It is a structural feature of how the market is organised. The digital twin is a software product. The operational transformation required to extract value from it is a change management programme. They are priced and sold separately, delivered by different teams, and the connection between them is treated as the buyer’s problem.
What you should actually buy
None of this means that the underlying technologies are useless. Sensor networks, condition monitoring, physics simulation, and state estimation are all valuable. They have been valuable for decades, deployed under less glamorous names: SCADA, historian systems, FEA models, Kalman filters, process control. They work when implemented with realistic expectations about what they can and cannot do.
The question to ask any digital twin vendor is not “can you build a twin of my asset?” The answer to that question is always yes, because the definition of twin has been made expansive enough to guarantee it. The questions to ask are:
What is your model type, and how does it perform outside the training distribution?
What is your assumed sensor data quality, and what happens when the actual data quality is lower?
What is the recalibration process when the model drifts, and who pays for it?
What operational workflow changes are required to act on the twin’s outputs, and are those in scope?
What is the performance of your system against a simple statistical baseline on the same data?
If the vendor cannot answer all five questions with specifics, you are not buying a digital twin.
You are buying a 3D dashboard with aspirational pricing.
The technology is not the problem. The framing is the problem. “Twin” implies a fidelity that the physics of model drift, sensor noise, and data latency make impossible to sustain. A model that starts accurate and degrades silently is not a twin. It is a liability, because it creates the conditions for confident decisions made on stale information.
The industry built a business on a word. The word implies a guarantee the physics will not allow. And the gap between the word and the reality is, reliably and consistently, your problem.

