resources

/

research, marketing mix models

Calibrate Your MMM? What Others Won't Tell You

We tested 1,241 models: bad experiments degrade calibrated MMMs by 10-40%. Use our EQS Framework to filter before calibrating.
Cassandra's Experiment Quality Score Framework

Get a weekly dose of insightful people strategy content

MMM Calibration: What the Industry Won't Tell You About Experiment Quality

The Short Version

We tested whether adding incrementality experiment results to Marketing Mix Models improves budget decisions. Here's what we found across 327 datasets and 1,241 models:

  • Good experiments make your model significantly more accurate — channel attribution improves 6-16%, and budget recommendations get meaningfully closer to optimal.

  • Bad experiments make it worse than having no experiments at all — accuracy degrades 10-40% because the model trusts wrong measurements.

  • The fix is a quality filter — using only the single best experiment per channel (the "Single Best" strategy) captures nearly all the upside while filtering out the noise.

  • We built a scoring framework (EQS) to help you decide which experiments are safe to use for calibration and which ones to throw out.

The conclusion: calibration is essential for decision-grade accuracy. But quality filtering is just as essential.

Why This Matters: Your MMM Might Misallocate Your Capital

Your Marketing Mix Model is only useful if it tells you where to spend your next dollar. If it's wrong by 20%, that's not just a rounding error — it's a budget misallocation.

On a $10M annual ad budget, a 20% attribution error means $2M is going to the wrong channels every year. That's money spent on channels that aren't driving the returns your model says they are.

Most MMMs learn from historical correlations between spend and revenue. The problem: channels that happen to run during high-revenue periods (holidays, product launches) get credited for revenue they didn't drive. Channels with steady-state budgets get undercredited for revenue they did drive.

Incrementality experiments measure what actually changes when you increase or decrease spend. They're the ground truth your model needs.

Calibration bridges the gap: it uses experiment results to force your model to align with measured reality, not just correlations.

Think of it this way: an uncalibrated MMM is GPS without satellite correction. It gets you in the right neighborhood, but it might put you on the wrong street. Calibration is the satellite signal that snaps it to the right address.

But here's the catch: not all experiments are reliable. And we wanted to know: what happens when you calibrate with experiments that got the wrong answer?

What We Tested

We ran the largest controlled simulation of MMM calibration to date. Here's the setup:

  • Starting point: Real marketing data from two e-commerce businesses — a fashion brand (5 channels, Meta-heavy) and a home goods brand (6 channels, balanced spend).

  • Scale: 327 synthetic datasets with KNOWN GROUND TRUTH, so we could measure exactly how accurate each model was.

  • Models trained: 1,241 models across three scenarios.

The three scenarios:

Scenario

What It Tests

Why It Matters

No Calibration (Baseline)

Model learns only from historical data

This is what most teams are doing today

Good Experiments

Model is calibrated with experiments that correctly measured channel impact

Best case: what happens when your experiments are right?

Bad Experiments

Model is calibrated with experiments that got the wrong answer

Worst case: what happens when your experiments are wrong?

Because we built the synthetic data, we knew the true contribution of each channel. This is something that is impossible to know in the real world. This let us measure exactly how much each scenario helped or hurt.

The Results: What Changes When You Calibrate

With Good Experiments: Everything Gets Better

Compared to an uncalibrated baseline, models calibrated with high-quality experiments delivered:

  • Channel attribution accuracy improved 6-16% — meaning your budget recommendations get meaningfully closer to optimal.

  • Prediction accuracy improved ~44% — the model becomes substantially better at forecasting revenue from different spend levels.

  • Model stability improved ~53% — less variance between model runs, so you can trust the results more.

  • Every single accuracy metric improved. No exceptions.

What this means in practice: For every $100 your uncalibrated model says to spend on a channel, the true optimal might be $84 or $116. With good calibration, that range tightens to $92-$108. Over a $10M budget, that's the difference between $1.6M misallocated and $800K misallocated.

With Bad Experiments: It Gets Worse Than No Experiments

Here's the critical finding: calibrating with bad experiments doesn't just fail to help, but also actively makes models worse than having no calibration inputs at all.

What We Measured

No Calibration

Good Experiments

Bad Experiments

Channel attribution accuracy

Baseline

Better (+6-16%)

Worse (-10-40%)

Prediction accuracy

Baseline

Better (+44%)

Worse (-20%)

Model stability

Baseline

Better (+53%)

Worse (-5-92%)

Budget recommendation quality

Baseline

Significantly better

Significantly worse

The mechanism is intuitive: when you tell your model to trust incorrect measurements, it overrides what it would have learned from the data on its own. This gives you a model confidently pointing you to the wrong conclusions.

Cassandra's EQS Framework impact of experiment quality on mmm accuracy


The Risk: When Calibration Backfires

Why do so many experiments get it wrong? Three common reasons:

  1. Insufficient test duration. Short tests (2-4 weeks) often don't run long enough for the effect to stabilize. The measured result captures noise, not signal.

  2. Contaminated holdout groups. If users in the "no ads" group still see your ads through other channels, the measured lift is understated. Even 5% leakage can throw off results.

  3. Measurement window mismatch. If your experiment measures impact over 7 days but your model attributes impact over 30 days, they're measuring different things. The experiment result looks wrong to the model because it is wrong — for the model's purposes.

The danger compounds when you use multiple bad experiments. Each one pulls the model in a different wrong direction. Conflicting bad signals create more damage than a single bad signal.

Our data showed that the "use all experiments" approach was actually the riskiest strategy. When experiment quality is mixed, using everything creates conflicting constraints that degrade accuracy by 10-40%.

How to Score Your Experiments: The EQS Framework

The research makes one thing clear: experiment quality filtering isn't optional. You need a systematic way to decide which experiments are safe to use for calibration.

We built the Experiment Quality Score (EQS) — a 100-point framework that evaluates every experiment across four dimensions before it's allowed to influence your model.

Cassandra's Experiment Quality Score (EQS) Framework

Dimension 1: Statistical Uncertainty (0-25 points)

How much uncertainty is there behind the result?

For calibration, precision beats power. Once an experiment is read out (significant or not), the most useful quality signal is how tight and stable the estimate is. Wide intervals mean the "true" effect could be dramatically different from the point estimate — which makes calibration risky.

Score

Criteria

What This Means

25 (Gold)

Tight estimate: confidence interval (CI) is within ±25% of the point estimate (or tighter). Results are stable under small sensitivity checks (no sign flips).

High-precision signal. Safe to use for calibration.

15 (Silver)

Moderate uncertainty: CI is ±25% to ±75% of the point estimate, and sensitivity checks shift the estimate but don’t break it.

Directionally useful, but calibration impact should be tempered.

5 (Red)

High uncertainty: CI is wider than ±75% of the point estimate

Too noisy for calibration. High risk of pulling the model in the wrong direction.

Dimension 2: Measurement Alignment (0-25 points)

Does the experiment measure the same thing your model measures?

This dimension measures the coherence between experiment length and lag effect estimated by the MMM. Your MMM might attribute revenue for a channel with 14 days lag effect. If the experiment only lasted 7 days, those numbers are measuring different things — and the mismatch will mislead the model.

The experiment design and its duration need to be coherent with the MMM insights. If it's not, either the MMM or the experiment are wrong and you'll need to re-do them.

Score

Criteria

What This Means

25 (Gold)

Experiment and model windows within ±3 days

Apples-to-apples comparison. The experiment result maps directly to the model.

15 (Silver)

Windows within ±7 days

Close enough to be useful, but some revenue is missed or double-counted.

5 (Red)

Gap > 7 days or unknown windows

The experiment and model are measuring different things. Not safe for calibration.

Dimension 3: Holdout Quality (0-25 points)

Was the control group truly unexposed?

A clean holdout means the "no ads" group genuinely saw no ads from that channel. In practice, cross-channel leakage, organic reach, and audience overlap can contaminate the control group — understating true impact.

Score

Criteria

What This Means

25 (Gold)

Clean isolation, verified no cross-contamination

The control group was truly unexposed. The measured lift is reliable.

15 (Silver)

Minor leakage < 5%

Mostly clean. Result is slightly understated but still directionally correct.

5 (Red)

Known contamination or unverified holdout

The control group saw ads. The measured lift is unreliable.

You should check if there was actually the right amount of budget invested in the channel. Most of the times, due to implementation errors, the test regions do not properly change the budget coherently to the experiment design (do a check).

Dimension 4: Model Alignment (0-25 points)

Does the experiment result match what your model expected?

Before calibrating, check: does the experiment's measured return fall within the model's confidence range for that channel? If the model says Meta returns $1.50-$3.00 per dollar and the experiment says $0.40, something is wrong — either with the experiment, the model, or both.

Score

Criteria

What This Means

25 (Gold)

Experiment result falls within model's confidence interval

The experiment confirms and refines what the model already believes. High-value calibration signal.

15 (Silver)

Result is within 2× the model's confidence interval

Some tension between experiment and model. Use with caution - investigate the gap before calibrating.

5 (Red)

Result is far outside model's confidence interval

Major disagreement. Either the experiment or the model is significantly wrong. Don't calibrate until you understand why.

Putting It Together: EQS Tiers

Tier

Score Range

Action

Gold

75-100

Calibrate with confidence. This experiment is reliable enough to improve your model.

Silver

50-74

Use cautiously. Apply only when you have no Gold-tier experiments for that channel. Monitor for degradation.

Red

0-49

Don't use for calibration. This experiment is more likely to hurt your model than help it. Investigate what went wrong.

Key insight: In our simulation, experiments scoring Gold (75+) consistently improved model accuracy. Experiments scoring Red (below 50) consistently degraded it. The scoring framework works because it filters on the same dimensions that determined success or failure in our study.

Your Calibration Checklist

Here's the step-by-step process to calibrate your MMM safely:

Step 1: Score Every Experiment

Before any calibration, score every available incrementality experiment using the EQS framework above. Be honest — if you don't know a dimension (e.g., holdout quality wasn't tracked), default to Red for that dimension.

Step 2: Filter by Tier

  • Gold experiments (75+): Approved for calibration.

  • Silver experiments (50-74): Keep as backup. Only use if no Gold experiment exists for that channel.

  • Red experiments (below 50): Exclude entirely. No exceptions.

Step 3: Apply Single Best Strategy

For each channel, select only the single highest-scoring experiment. If you have three Meta experiments scoring 90, 80, and 40, use only the 90.

Step 4: Calibrate and Compare

Run your model twice: once uncalibrated (baseline) and once calibrated with the selected experiments. Compare:

  • Do the channel contribution shares look more realistic?

  • Do budget recommendations change in a direction that makes business sense?

  • Is the model more stable across runs?

Step 5: Monitor Ongoing

Calibration isn't set-and-forget. Re-score experiments quarterly. As you run new experiments, add Gold-tier results and retire older ones. If your model starts showing erratic behavior after calibration, check whether an experiment you trusted has degraded.

Quick reference: Score → Filter → Select best → Calibrate → Compare → Monitor. The full cycle should run quarterly or bi-annually.

Frequently Asked Questions

What is MMM calibration in simple terms?

It's using results from incrementality experiments to make your Marketing Mix Model more accurate. Without calibration, the model only learns from historical patterns. With calibration, it also learns from controlled tests that measure true channel impact.

How much budget impact can calibration have?

On a M budget, the difference between an uncalibrated and a well-calibrated model can be -2M in misallocated spend per year. Channel attribution accuracy improves 6-16%, which directly translates to better budget decisions.

Should I use every experiment I have?

No, and this is the most important takeaway. In our study, not all experiments were able to capture true channel impact. Using all experiments without applying quality filters made the model worse than having no experiments at all. Use the EQS framework to filter, then apply only the single best experiment per channel.

How do I know if my experiments are good enough?

Score them using the EQS framework in this article. Experiments need a score of 75+ (Gold tier) to calibrate with confidence. The four dimensions: statistical uncertainty, measurement alignment, holdout quality, and model alignment — these cover the factors that determined success or failure in our study.

What if I don't have any Gold-tier experiments?

Don't calibrate yet. An uncalibrated model is better than a model calibrated with bad experiments. Focus on improving your experiment design — longer test durations, cleaner holdouts, and aligned measurement windows. One Gold-tier experiment is worth more than ten low quality experiments.

How often should I recalibrate?

Quarterly or semi-annually. Run experiments in one quarter, validate the results, then use the Gold-tier ones to recalibrate in the following quarter. Markets change, so experiments older than 6-12 months should be retired.

Can bad calibration be worse than no calibration?

Yes. Our study of 1,241 models showed that calibrating with low-quality experiments degraded accuracy by 10-40% — significantly worse than an uncalibrated baseline. The model confidently points to wrong conclusions because it trusts incorrect measurements.

What is the EQS Framework?

The Experiment Quality Score (EQS) is a 100-point framework that evaluates incrementality experiments across four dimensions: statistical uncertainty, measurement alignment, holdout quality, and model alignment. Experiments scoring 75+ (Gold tier) are safe for calibration; below 50 (Red tier) should never be used.

What is the Single Best calibration strategy?

Instead of using all available experiments, select only the single highest-quality experiment per channel for calibration. In our study, this strategy captured nearly all the upside of good experiments (+4% to +37%) while limiting downside from bad ones to just -2% to -5%.

Take the Guesswork Out of Calibration

Scoring experiments, filtering by quality, and calibrating safely — it's a lot to manage manually. Especially when you're running experiments across multiple channels on a rolling basis.

Cassandra's Always-On Incrementality automates this entire pipeline:

  • Continuous incrementality testing across all channels — no manual experiment setup.

  • Automated EQS scoring — every experiment is scored on all four dimensions before it touches the model.

  • Smart calibration — the experiment selection strategy is built in, with automatic quality filtering.

  • Real-time monitoring — dashboards tracking calibration effect, experiment health, and model accuracy over time.

The result: decision-grade accuracy without the risk of bad experiments degrading your model.

Ready to see what calibrated MMM looks like for your budget?

We'll score your experiments and show you exactly where your current model is misallocating spend.

Methodology: This simulation study used real marketing data from two e-commerce businesses (2020-2025), synthesized 327 MMM datasets with known ground truth, and trained 1,241 models across three calibration scenarios. The controlled environment allowed direct measurement of calibration impact — something impossible with real-world data where true channel effectiveness is unknown.

Author: Gabriele Franco, Founder & CEO of Cassandra

Copyright © 2025 – All Right Reserved

Copyright © 2024-2025 – All Right Reserved