How to Transform a Broken Experimentation Program
Most organizations believe their experimentation program is working. Many are wrong.
I inherited an experimentation program where stakeholders genuinely believed the A/B testing infrastructure was producing reliable results. There was no obvious crisis—tests ran, results came back, decisions got made. The problem was invisible until you looked closely.
When I audited 2.5 years of historical experiments against basic validity standards, only 6 out of approximately 100 met the bar. A 6% success rate. The organization had been making product decisions based on noise.
This is more common than you’d think. Here’s how I approached fixing it—and how you can apply the same framework.
Why Experimentation Programs Fail Silently
Experimentation dysfunction rarely announces itself. The symptoms are subtle:
- Tests “complete” but results are noisy or contradictory
- Statistical significance is achieved but effects don’t replicate
- Teams lose faith in experimentation and fall back to HiPPO decisions
- Nobody can point to experiments that definitively changed strategy
The root cause is usually organizational, not technical. Teams adopt experimentation tooling without establishing the standards, processes, and culture to use it effectively. The technology works; the system around it doesn’t.
The Consensus-First Framework
The natural instinct when you suspect a problem is to gather data proving it. This is a mistake. If you show up with evidence that “your experiments are broken,” you’ll trigger defensive reactions. The data becomes debatable, the methodology gets questioned, and nothing changes.
Instead, flip the sequence:
Step 1: Define Standards Before Measuring
Build a comprehensive framework defining what constitutes a reliable experiment:
- Statistical validity requirements
- Proper randomization criteria
- Sample size adequacy
- Metric selection guidelines
- Bias detection methods
- Confound analysis
These aren’t controversial—they’re established principles. Document them clearly.
Step 2: Get Agreement on the Rubric
Work with your team’s subject matter experts to codify the framework into an evaluation rubric. The key: get everyone to agree on the measurement criteria before applying those criteria to your work.
This is critical. Once stakeholders agree “a valid experiment must have X, Y, and Z,” they can’t later argue with findings that experiments lack X, Y, or Z. You’ve transformed a debate about conclusions into a shared discovery.
Step 3: Apply the Mirror
Now run your historical experiments through the agreed-upon rubric. Present the findings as “here’s what we learned together” rather than “here’s what’s wrong.”
When the data revealed our 6% validity rate, there was no argument. Everyone in the room had already agreed on what “valid” meant. The gap between perception and reality was undeniable—and shared.
Quick Wins Create Momentum
Resist the urge to redesign everything. Large transformations fail when they try to change too much at once.
In my case, one fix—correcting a randomization algorithm flaw—eliminated 90% of our experiment volatility. It took weeks, not months.
This quick win accomplished several things:
- Built credibility - Proved improvement was achievable
- Created breathing room - Reduced the urgent noise so we could think strategically
- Generated allies - People who benefited became advocates for deeper change
Find your equivalent quick win. There’s almost always a single issue causing disproportionate harm.
The Eight Dimensions of Excellence
After stabilizing the immediate problems, I researched best-in-class experimentation programs (Microsoft’s is particularly well-documented) and identified eight dimensions that mature programs excel at:
- Analysis Platform - Statistical rigor and visualization
- Feature Flagging - Safe rollout and targeting infrastructure
- Automation - Self-service experiment launch and monitoring
- Education - Org-wide training on experimental design
- Process - Experiment lifecycle management
- Standards - Naming conventions, documentation requirements, review gates
- Infrastructure - Data pipeline reliability
- Culture - Experiment-first decision-making mindset
You don’t need to be excellent at all eight immediately. Assess where you are, identify the biggest gaps, and build a roadmap.
Building for Sustainability
The ultimate test of transformation work is whether it survives your departure. If the system depends on you, you haven’t built a system—you’ve built a dependency.
Design for sustainability from the start:
- Document everything - Processes, decisions, rationale
- Hire for ownership - Find someone who can own and evolve the program
- Transfer gradually - Step back deliberately so others step up
- Create feedback loops - The system should self-correct without intervention
I hired an exceptionally capable person to own the program, gave them real authority, and progressively reduced my involvement. The program now runs hundreds of concurrent experiments and hasn’t needed my intervention in years.
Key Principles
Consensus precedes evidence. When challenging established beliefs, build agreement on measurement criteria before revealing measurements.
Organizational dysfunction masks as technical problems. A 6% validity rate wasn’t a technology failure—it was a failure to define and enforce standards. Fixing technology without addressing culture is temporary at best.
Quick wins unlock systematic change. One high-impact fix creates the credibility and space needed for deeper transformation.
Zero-to-one means building systems that don’t need you. The measure of success isn’t what you build—it’s whether it thrives after you leave.
This framework emerged from transforming an experimentation program from 6 valid experiments annually to hundreds running concurrently. The specific numbers will vary, but the approach transfers.