resources
/
research, marketing mix models
Calibrate Your MMM? What Others Won't Tell You
We tested 1,241 models: bad experiments degrade calibrated MMMs by 10-40%. Use our EQS Framework to filter before calibrating.


Get a weekly dose of insightful people strategy content
MMM Calibration: What the Industry Won't Tell You About Experiment Quality
The Short Version
We tested whether adding incrementality experiment results to Marketing Mix Models improves budget decisions. Here's what we found across 327 datasets and 1,241 models:
Good experiments make your model significantly more accurate — channel attribution improves 6-16%, and budget recommendations get meaningfully closer to optimal.
Bad experiments make it worse than having no experiments at all — accuracy degrades 10-40% because the model trusts wrong measurements.
The fix is a quality filter — using only the single best experiment per channel (the "Single Best" strategy) captures nearly all the upside while filtering out the noise.
We built a scoring framework (EQS) to help you decide which experiments are safe to use for calibration and which ones to throw out.
The conclusion: calibration is essential for decision-grade accuracy. But quality filtering is just as essential.
Why This Matters: Your MMM Might Misallocate Your Capital
Your Marketing Mix Model is only useful if it tells you where to spend your next dollar. If it's wrong by 20%, that's not just a rounding error — it's a budget misallocation.
On a $10M annual ad budget, a 20% attribution error means $2M is going to the wrong channels every year. That's money spent on channels that aren't driving the returns your model says they are.
Most MMMs learn from historical correlations between spend and revenue. The problem: channels that happen to run during high-revenue periods (holidays, product launches) get credited for revenue they didn't drive. Channels with steady-state budgets get undercredited for revenue they did drive.
Incrementality experiments measure what actually changes when you increase or decrease spend. They're the ground truth your model needs.
Calibration bridges the gap: it uses experiment results to force your model to align with measured reality, not just correlations.
Think of it this way: an uncalibrated MMM is GPS without satellite correction. It gets you in the right neighborhood, but it might put you on the wrong street. Calibration is the satellite signal that snaps it to the right address.
But here's the catch: not all experiments are reliable. And we wanted to know: what happens when you calibrate with experiments that got the wrong answer?
What We Tested
We ran the largest controlled simulation of MMM calibration to date. Here's the setup:
Starting point: Real marketing data from two e-commerce businesses — a fashion brand (5 channels, Meta-heavy) and a home goods brand (6 channels, balanced spend).
Scale: 327 synthetic datasets with KNOWN GROUND TRUTH, so we could measure exactly how accurate each model was.
Models trained: 1,241 models across three scenarios.
The three scenarios:
Scenario | What It Tests | Why It Matters |
|---|---|---|
No Calibration (Baseline) | Model learns only from historical data | This is what most teams are doing today |
Good Experiments | Model is calibrated with experiments that correctly measured channel impact | Best case: what happens when your experiments are right? |
Bad Experiments | Model is calibrated with experiments that got the wrong answer | Worst case: what happens when your experiments are wrong? |
Because we built the synthetic data, we knew the true contribution of each channel. This is something that is impossible to know in the real world. This let us measure exactly how much each scenario helped or hurt.
The Results: What Changes When You Calibrate
With Good Experiments: Everything Gets Better
Compared to an uncalibrated baseline, models calibrated with high-quality experiments delivered:
Channel attribution accuracy improved 6-16% — meaning your budget recommendations get meaningfully closer to optimal.
Prediction accuracy improved ~44% — the model becomes substantially better at forecasting revenue from different spend levels.
Model stability improved ~53% — less variance between model runs, so you can trust the results more.
Every single accuracy metric improved. No exceptions.
What this means in practice: For every $100 your uncalibrated model says to spend on a channel, the true optimal might be $84 or $116. With good calibration, that range tightens to $92-$108. Over a $10M budget, that's the difference between $1.6M misallocated and $800K misallocated.
With Bad Experiments: It Gets Worse Than No Experiments
Here's the critical finding: calibrating with bad experiments doesn't just fail to help, but also actively makes models worse than having no calibration inputs at all.
What We Measured | No Calibration | Good Experiments | Bad Experiments |
|---|---|---|---|
Channel attribution accuracy | Baseline | Better (+6-16%) | Worse (-10-40%) |
Prediction accuracy | Baseline | Better (+44%) | Worse (-20%) |
Model stability | Baseline | Better (+53%) | Worse (-5-92%) |
Budget recommendation quality | Baseline | Significantly better | Significantly worse |
The mechanism is intuitive: when you tell your model to trust incorrect measurements, it overrides what it would have learned from the data on its own. This gives you a model confidently pointing you to the wrong conclusions.
The Risk: When Calibration Backfires
Why do so many experiments get it wrong? Three common reasons:
Insufficient test duration. Short tests (2-4 weeks) often don't run long enough for the effect to stabilize. The measured result captures noise, not signal.
Contaminated holdout groups. If users in the "no ads" group still see your ads through other channels, the measured lift is understated. Even 5% leakage can throw off results.
Measurement window mismatch. If your experiment measures impact over 7 days but your model attributes impact over 30 days, they're measuring different things. The experiment result looks wrong to the model because it is wrong — for the model's purposes.
The danger compounds when you use multiple bad experiments. Each one pulls the model in a different wrong direction. Conflicting bad signals create more damage than a single bad signal.
Our data showed that the "use all experiments" approach was actually the riskiest strategy. When experiment quality is mixed, using everything creates conflicting constraints that degrade accuracy by 10-40%.
How to Score Your Experiments: The EQS Framework
The research makes one thing clear: experiment quality filtering isn't optional. You need a systematic way to decide which experiments are safe to use for calibration.
We built the Experiment Quality Score (EQS) — a 100-point framework that evaluates every experiment across four dimensions before it's allowed to influence your model.
Dimension 1: Statistical Uncertainty (0-25 points)
How much uncertainty is there behind the result?
For calibration, precision beats power. Once an experiment is read out (significant or not), the most useful quality signal is how tight and stable the estimate is. Wide intervals mean the "true" effect could be dramatically different from the point estimate — which makes calibration risky.
Score | Criteria | What This Means |
|---|---|---|
25 (Gold) | Tight estimate: confidence interval (CI) is within ±25% of the point estimate (or tighter). Results are stable under small sensitivity checks (no sign flips). | High-precision signal. Safe to use for calibration. |
15 (Silver) | Moderate uncertainty: CI is ±25% to ±75% of the point estimate, and sensitivity checks shift the estimate but don’t break it. | Directionally useful, but calibration impact should be tempered. |
5 (Red) | High uncertainty: CI is wider than ±75% of the point estimate | Too noisy for calibration. High risk of pulling the model in the wrong direction. |
Dimension 2: Measurement Alignment (0-25 points)
Does the experiment measure the same thing your model measures?
This dimension measures the coherence between experiment length and lag effect estimated by the MMM. Your MMM might attribute revenue for a channel with 14 days lag effect. If the experiment only lasted 7 days, those numbers are measuring different things — and the mismatch will mislead the model.
The experiment design and its duration need to be coherent with the MMM insights. If it's not, either the MMM or the experiment are wrong and you'll need to re-do them.
Score | Criteria | What This Means |
|---|---|---|
25 (Gold) | Experiment and model windows within ±3 days | Apples-to-apples comparison. The experiment result maps directly to the model. |
15 (Silver) | Windows within ±7 days | Close enough to be useful, but some revenue is missed or double-counted. |
5 (Red) | Gap > 7 days or unknown windows | The experiment and model are measuring different things. Not safe for calibration. |
Dimension 3: Holdout Quality (0-25 points)
Was the control group truly unexposed?
A clean holdout means the "no ads" group genuinely saw no ads from that channel. In practice, cross-channel leakage, organic reach, and audience overlap can contaminate the control group — understating true impact.
Score | Criteria | What This Means |
|---|---|---|
25 (Gold) | Clean isolation, verified no cross-contamination | The control group was truly unexposed. The measured lift is reliable. |
15 (Silver) | Minor leakage < 5% | Mostly clean. Result is slightly understated but still directionally correct. |
5 (Red) | Known contamination or unverified holdout | The control group saw ads. The measured lift is unreliable. |
You should check if there was actually the right amount of budget invested in the channel. Most of the times, due to implementation errors, the test regions do not properly change the budget coherently to the experiment design (do a check).
Dimension 4: Model Alignment (0-25 points)
Does the experiment result match what your model expected?
Before calibrating, check: does the experiment's measured return fall within the model's confidence range for that channel? If the model says Meta returns $1.50-$3.00 per dollar and the experiment says $0.40, something is wrong — either with the experiment, the model, or both.
Score | Criteria | What This Means |
|---|---|---|
25 (Gold) | Experiment result falls within model's confidence interval | The experiment confirms and refines what the model already believes. High-value calibration signal. |
15 (Silver) | Result is within 2× the model's confidence interval | Some tension between experiment and model. Use with caution - investigate the gap before calibrating. |
5 (Red) | Result is far outside model's confidence interval | Major disagreement. Either the experiment or the model is significantly wrong. Don't calibrate until you understand why. |
Putting It Together: EQS Tiers
Tier | Score Range | Action |
|---|---|---|
Gold | 75-100 | Calibrate with confidence. This experiment is reliable enough to improve your model. |
Silver | 50-74 | Use cautiously. Apply only when you have no Gold-tier experiments for that channel. Monitor for degradation. |
Red | 0-49 | Don't use for calibration. This experiment is more likely to hurt your model than help it. Investigate what went wrong. |
Key insight: In our simulation, experiments scoring Gold (75+) consistently improved model accuracy. Experiments scoring Red (below 50) consistently degraded it. The scoring framework works because it filters on the same dimensions that determined success or failure in our study.
Your Calibration Checklist
Here's the step-by-step process to calibrate your MMM safely:
Step 1: Score Every Experiment
Before any calibration, score every available incrementality experiment using the EQS framework above. Be honest — if you don't know a dimension (e.g., holdout quality wasn't tracked), default to Red for that dimension.
Step 2: Filter by Tier
Gold experiments (75+): Approved for calibration.
Silver experiments (50-74): Keep as backup. Only use if no Gold experiment exists for that channel.
Red experiments (below 50): Exclude entirely. No exceptions.
Step 3: Apply Single Best Strategy
For each channel, select only the single highest-scoring experiment. If you have three Meta experiments scoring 90, 80, and 40, use only the 90.
Step 4: Calibrate and Compare
Run your model twice: once uncalibrated (baseline) and once calibrated with the selected experiments. Compare:
Do the channel contribution shares look more realistic?
Do budget recommendations change in a direction that makes business sense?
Is the model more stable across runs?
Step 5: Monitor Ongoing
Calibration isn't set-and-forget. Re-score experiments quarterly. As you run new experiments, add Gold-tier results and retire older ones. If your model starts showing erratic behavior after calibration, check whether an experiment you trusted has degraded.
Quick reference: Score → Filter → Select best → Calibrate → Compare → Monitor. The full cycle should run quarterly or bi-annually.
Frequently Asked Questions
What is MMM calibration in simple terms?
It's using results from incrementality experiments to make your Marketing Mix Model more accurate. Without calibration, the model only learns from historical patterns. With calibration, it also learns from controlled tests that measure true channel impact.
How much budget impact can calibration have?
On a M budget, the difference between an uncalibrated and a well-calibrated model can be -2M in misallocated spend per year. Channel attribution accuracy improves 6-16%, which directly translates to better budget decisions.
Should I use every experiment I have?
No, and this is the most important takeaway. In our study, not all experiments were able to capture true channel impact. Using all experiments without applying quality filters made the model worse than having no experiments at all. Use the EQS framework to filter, then apply only the single best experiment per channel.
How do I know if my experiments are good enough?
Score them using the EQS framework in this article. Experiments need a score of 75+ (Gold tier) to calibrate with confidence. The four dimensions: statistical uncertainty, measurement alignment, holdout quality, and model alignment — these cover the factors that determined success or failure in our study.
What if I don't have any Gold-tier experiments?
Don't calibrate yet. An uncalibrated model is better than a model calibrated with bad experiments. Focus on improving your experiment design — longer test durations, cleaner holdouts, and aligned measurement windows. One Gold-tier experiment is worth more than ten low quality experiments.
How often should I recalibrate?
Quarterly or semi-annually. Run experiments in one quarter, validate the results, then use the Gold-tier ones to recalibrate in the following quarter. Markets change, so experiments older than 6-12 months should be retired.
Can bad calibration be worse than no calibration?
Yes. Our study of 1,241 models showed that calibrating with low-quality experiments degraded accuracy by 10-40% — significantly worse than an uncalibrated baseline. The model confidently points to wrong conclusions because it trusts incorrect measurements.
What is the EQS Framework?
The Experiment Quality Score (EQS) is a 100-point framework that evaluates incrementality experiments across four dimensions: statistical uncertainty, measurement alignment, holdout quality, and model alignment. Experiments scoring 75+ (Gold tier) are safe for calibration; below 50 (Red tier) should never be used.
What is the Single Best calibration strategy?
Instead of using all available experiments, select only the single highest-quality experiment per channel for calibration. In our study, this strategy captured nearly all the upside of good experiments (+4% to +37%) while limiting downside from bad ones to just -2% to -5%.
Take the Guesswork Out of Calibration
Scoring experiments, filtering by quality, and calibrating safely — it's a lot to manage manually. Especially when you're running experiments across multiple channels on a rolling basis.
Cassandra's Always-On Incrementality automates this entire pipeline:
Continuous incrementality testing across all channels — no manual experiment setup.
Automated EQS scoring — every experiment is scored on all four dimensions before it touches the model.
Smart calibration — the experiment selection strategy is built in, with automatic quality filtering.
Real-time monitoring — dashboards tracking calibration effect, experiment health, and model accuracy over time.
The result: decision-grade accuracy without the risk of bad experiments degrading your model.
Ready to see what calibrated MMM looks like for your budget?
We'll score your experiments and show you exactly where your current model is misallocating spend.
Methodology: This simulation study used real marketing data from two e-commerce businesses (2020-2025), synthesized 327 MMM datasets with known ground truth, and trained 1,241 models across three calibration scenarios. The controlled environment allowed direct measurement of calibration impact — something impossible with real-world data where true channel effectiveness is unknown.
Author: Gabriele Franco, Founder & CEO of Cassandra
The results don't lie
See how 100+ marketing teams trust us to deliver












