How to Transform a Broken Experimentation Program

Most organizations believe their experimentation program is working. Many are wrong.

I inherited an experimentation program where stakeholders genuinely believed the A/B testing infrastructure was producing reliable results. There was no obvious crisis—tests ran, results came back, decisions got made. The problem was invisible until you looked closely.

When I audited 2.5 years of historical experiments against basic validity standards, only 6 out of approximately 100 met the bar. A 6% success rate. The organization had been making product decisions based on noise.

This is more common than you’d think. Here’s how I approached fixing it—and how you can apply the same framework.

Why Experimentation Programs Fail Silently

Experimentation dysfunction rarely announces itself. The symptoms are subtle:

Tests “complete” but results are noisy or contradictory
Statistical significance is achieved but effects don’t replicate
Teams lose faith in experimentation and fall back to HiPPO decisions
Nobody can point to experiments that definitively changed strategy

The root cause is usually organizational, not technical. Teams adopt experimentation tooling without establishing the standards, processes, and culture to use it effectively. The technology works; the system around it doesn’t.

The Consensus-First Framework

The natural instinct when you suspect a problem is to gather data proving it. This is a mistake. If you show up with evidence that “your experiments are broken,” you’ll trigger defensive reactions. The data becomes debatable, the methodology gets questioned, and nothing changes.

Instead, flip the sequence:

Step 1: Define Standards Before Measuring

Build a comprehensive framework defining what constitutes a reliable experiment:

Statistical validity requirements
Proper randomization criteria
Sample size adequacy
Metric selection guidelines
Bias detection methods
Confound analysis

These aren’t controversial—they’re established principles. Document them clearly.

Step 2: Get Agreement on the Rubric

Work with your team’s subject matter experts to codify the framework into an evaluation rubric. The key: get everyone to agree on the measurement criteria before applying those criteria to your work.

This is critical. Once stakeholders agree “a valid experiment must have X, Y, and Z,” they can’t later argue with findings that experiments lack X, Y, or Z. You’ve transformed a debate about conclusions into a shared discovery.

Step 3: Apply the Mirror

Now run your historical experiments through the agreed-upon rubric. Present the findings as “here’s what we learned together” rather than “here’s what’s wrong.”

When the data revealed our 6% validity rate, there was no argument. Everyone in the room had already agreed on what “valid” meant. The gap between perception and reality was undeniable—and shared.

Quick Wins Create Momentum

Resist the urge to redesign everything. Large transformations fail when they try to change too much at once.

In my case, one fix—correcting a randomization algorithm flaw—eliminated 90% of our experiment volatility. It took weeks, not months.

This quick win accomplished several things:

Built credibility - Proved improvement was achievable
Created breathing room - Reduced the urgent noise so we could think strategically
Generated allies - People who benefited became advocates for deeper change

Find your equivalent quick win. There’s almost always a single issue causing disproportionate harm.

The Eight Dimensions of Excellence

After stabilizing the immediate problems, I researched best-in-class experimentation programs (Microsoft’s is particularly well-documented) and identified eight dimensions that mature programs excel at:

Analysis Platform - Statistical rigor and visualization
Feature Flagging - Safe rollout and targeting infrastructure
Automation - Self-service experiment launch and monitoring
Education - Org-wide training on experimental design
Process - Experiment lifecycle management
Standards - Naming conventions, documentation requirements, review gates
Infrastructure - Data pipeline reliability
Culture - Experiment-first decision-making mindset

You don’t need to be excellent at all eight immediately. Assess where you are, identify the biggest gaps, and build a roadmap.

Building for Sustainability

The ultimate test of transformation work is whether it survives your departure. If the system depends on you, you haven’t built a system—you’ve built a dependency.

Design for sustainability from the start:

Document everything - Processes, decisions, rationale
Hire for ownership - Find someone who can own and evolve the program
Transfer gradually - Step back deliberately so others step up
Create feedback loops - The system should self-correct without intervention

I hired an exceptionally capable person to own the program, gave them real authority, and progressively reduced my involvement. The program now runs hundreds of concurrent experiments and hasn’t needed my intervention in years.

Key Principles

Consensus precedes evidence. When challenging established beliefs, build agreement on measurement criteria before revealing measurements.

Organizational dysfunction masks as technical problems. A 6% validity rate wasn’t a technology failure—it was a failure to define and enforce standards. Fixing technology without addressing culture is temporary at best.

Quick wins unlock systematic change. One high-impact fix creates the credibility and space needed for deeper transformation.

Zero-to-one means building systems that don’t need you. The measure of success isn’t what you build—it’s whether it thrives after you leave.

This framework emerged from transforming an experimentation program from 6 valid experiments annually to hundreds running concurrently. The specific numbers will vary, but the approach transfers.