Riding Out the Messy Middle of A/B Testing: From Hypothesis to Intervention
The “messy middle” of A/B testing—when hypotheses have to be given a form so they go live as a treatment or intervention—demands a blend of data-driven insight and creative problem-solving.
You may recall our exploration of triangulation: combining quantitative, qualitative, and visual data to uncover what users do and why they do it.
But the challenge lies in what comes next: How do you design interventions or treatments that go beyond addressing surface-level issues and instead drive meaningful improvements?
- Understand the nuanced challenges of designing effective interventions
- Be strategic when designing interventions or treatments
- Use proven frameworks like the ALARM protocol.
- Don’t fall prey to “best practices”
- Don’t be too quick to jump into solutionizing
- Accept that AI is better than humans at coming up with ideas
Glossary:
- Hypotheses: A hypothesis is a testable prediction or proposed solution to a business problem. Eg: If we reduce the number of form fields on the checkout page, then the completion rate will increase, because users will perceive the process as quicker and easier. The business problem in this case being high form abandonment.
- Interventions: An intervention is any change or action intended to influence outcomes. Eg: redesigning the checkout page to make it easier for users to complete purchases. It is different from a treatment which is the specific version/form of the intervention tested within an experiment.
- Test Design: The “how”—the methodology for evaluating interventions. Test design involves creating a structured approach to measure the effectiveness of the intervention.
Why Designing Interventions Is A Challenge
Once you’ve identified conversion roadblocks, the real work begins: Translating those insights into effective interventions. This is where many optimizers encounter the complexities of A/B testing. The “messy middle” is fraught with challenges and to find out what those are and how frameworks can help, we ran a survey and sourced expert insights.
Survey Methodology
● A survey of 26 CRO professionals from the Women in Experimentation Community and our vetted SME pool form the foundation of our analysis.
● We gathered additional qualitative insights from 18 experts through Connectively (a platform previously known as HARO) and another 18 from Featured.
This brings our total sample to 62 contributors. While modest in size, this diverse group offers valuable insights into the challenges, strategies, and best practices employed by professionals across the industry.
The Pitfalls of Intervention Design
Our survey uncovered several common challenges in this phase:
- Balancing creative problem-solving with proven methods: 42% of respondents struggle to find the right balance between creative problem-solving and relying on established best practices.
- Dealing with ambiguous data: Despite the abundance of data available, 38% of optimizers find themselves contending with inconclusive or insufficient data when designing interventions.
- Time and resource constraints: 50% of respondents cite time pressure and limited resources as significant obstacles.
Strategies for Effective Intervention Design
With structured processes and frameworks, optimizers can navigate this process with confidence and precision.
- Focus on the quality of the execution because it influences results
Execution quality is critical, even when supported by strong data. As Saskia Cook, Director of Consultancy at Conversion, explains:
“Even if you’ve got a strong, data-backed hypothesis, the quality of your execution can be the difference between a winner and a loser. For example, if you hypothesize that applying social proof will increase sales, don’t rush into deciding where or how to implement it without first gathering data to guide those decisions.”
It’s not just about knowing what to change—it’s about understanding how and where to implement those changes to maximize impact.
- Use historical data to learn what’s worked and what hasn’t
A powerful strategy to enhance intervention design is leveraging historical data from past experiments. Saskia elaborates:
“At Conversion, we have a searchable database of thousands of experiments across industries. This allows us to answer questions like, ‘What’s the win rate of social proof experiments on the homepage versus the checkout?’ or ‘What’s the average uplift from experiments with an exit intent overlay?'”
Even without access to such databases, the principle remains: Learn from what has worked (and what hasn’t) in the past to inform your current strategy.
Our sample pool of testers also suggested the following ways to keep execution and subsequent iterations tightly data guided.
Frameworks for Effective Intervention Design
These frameworks and approaches come straight from the Test Design masterclass, hosted by Deborah O’Malley, Founder of GuessTheTest.
Meet the panel:
- Steph Le Prevost from Conversion
- Gabriela Florea from Verifone
- Marc Uitterhoeve from Dexter Agency
- Will Laurenson from Customers Who Click
- Dave Gowans from Browser to Buyer
Interestingly enough, when we asked our 64 research participants about execution frameworks, only 26.9% said they always used one. 30.8% batted in favor of a more flexible approach to deployment. And 35.1% said they treated intervention design on a case-by-case basis.
We’ll share some of the most popular test execution frameworks here:
1. The ALARM Protocol
Many professionals use structured frameworks like the ALARM protocol, developed by Conversion. This comprehensive framework allows experimenters to scrutinize their concepts from every angle, ensuring that hypotheses are tested as effectively as possible.
As Steph Le Prevost from Conversion emphasizes:
“The execution stage of your experiment is often overlooked, rushed or seen as an afterthought but it is always valuable to dissect your execution upfront and think about why your test wouldn’t win.”
The ALARM protocol addresses this challenge by providing a structured way to evaluate and refine experiment concepts before they’re built.
Let’s break down each component of the ALARM framework:
A: Alternative Executions
Consider different ways to test your hypothesis.
In a case study from Conversion, an initial experiment adding a comparison modal to a product category page yielded flat results. By considering alternative executions and enhancing the modal’s visibility, a subsequent test resulted in an 11.7% improvement in revenue.
Expert Tip: As Saskia Cook explains, “If you don’t currently have access to a searchable database of experiments, that shouldn’t stop you. Alternatively, you could draft multiple versions of copy for your experiment and run a quick preference test, with a sample of people from your company who match your target audience.”
L: Loss Factors
Identify at least four reasons why the test might fail and consider how to mitigate these risks. For instance, in a test adding customer ratings to a page, potential loss factors included:
- Users questioning the reliability of the rating based on the number of reviews.
- Users finding one review platform more credible than another.
Address mitigable risks (such as disclosing the number of reviews) and acknowledge unmitigable ones to increase your chances of designing a successful intervention.
A: Audience and Area
Ask whether there’s a better choice of audience or area to maximize the chance of success. Like the guide from Conversion states:
“Is there a risk that the change is too early or too late in the journey? Will the execution shrink your audience?”
An example from a vehicle rental website showed how testing the same concept (step-by-step booking instructions) on different pages (homepage vs. location page) can yield significantly different results.
R: Rigor
Take at least two actions to ensure the execution is as robust as possible. As Saskia notes:
“Another example from the ALARM protocol is growing your knowledge of psychological studies and assessing whether learnings can be applied to strengthen your execution.”
Examples of rigorous actions include:
- Referencing an experiment repository for learnings from similar websites.
- Applying relevant psychological principles, such as social proof or the picture superiority effect.
WATCH: Experimentation Rigour: The Secret Sauce of Successful Programs with Haley Carpenter, Founder at Chirpy
You’ll learn:
- What experimentation rigor is, and why it holds the A/B testing flywheel together
- Rigor in research and what that looks like for teams.
- Rigor in statistical analysis (and some pitfalls to avoid)
- How you can start to popularize the idea of rigor
M: Minimum Detectable Effect (MDE) and Minimum Viable Experiment (MVE)
Evaluate whether the concept is bold enough to hit the Minimum Detectable Effect while still being a Minimum Viable Experiment. As Conversion’s guide to ALARM explains:
“The balance we need to strike here is to ensure that our experiment is small enough to validate our hypothesis with minimum effort while being bold enough to hit our MDE.”
A simple copy change to a prominently displayed roundel resulted in a 3.55% uplift in transactions.
The ALARM protocol helps experimenters systematically improve the quality of their interventions, potentially leading to higher win rates and more substantial uplifts. As Steph Le Provost concludes:
“When you come to think about the execution of your experiment, you should not just blindly copy your competitor – you don’t know if what you are seeing is a test, and if you don’t have supporting data for that execution. It likely won’t be a good fit for your users.”
Note: In Convert’s Test Design masterclass, Steph also explained how the ALARM protocol works in conjunction with the Lever Framework at Conversion. The Lever Framework categorizes factors influencing user behavior into five master levers: Cost, Trust, Usability, Comprehension, and Motivation.
As Steph noted: “What it [the framework] is, is the lever that you can pull that you lean on in your experiment to change the user’s behavior.”
Once a lever is identified, the ALARM protocol is applied to refine the execution:
While the Lever Framework categorizes factors influencing user behavior into master levers (such as trust, usability, and motivation) and sub-levers (like social proof under trust), the ALARM protocol takes this a step further by guiding how these levers should be operationalized in experiments.
Save for later: How to Think like CRO Expert Steph Le Prevost
2. The Dexter Method
The Dexter Method (a trademarked framework by the Dexter Agency) ensures that interventions are effective and systematically implemented.
Marc Uitterhoeve, CEO of Dexter Agency, explains the acronym and what each step is:
D – Data
The first step involves rigorous data collection. Marc shares that “collecting your data” is critical because it forms the foundation for any intervention. This data-driven approach ensures that decisions are based on solid evidence rather than assumptions, making the interventions more likely to succeed.
EX – Execute
Execution is about putting insights into action quickly and efficiently. This step is especially important when there are clear, actionable items that don’t require extensive testing. For example, fixing a bug or responding to clear user feedback—such as improving image quality when users struggle to see product details—can be executed immediately without testing.
T – Test
Testing is reserved for interventions where the outcome is uncertain or you want to validate the impact of changes. Marc shares that while not everything needs to be tested, testing remains a crucial part of ensuring that interventions lead to desired results. This step involves running A/B tests to determine if the changes are effective.
E – Evaluate
After execution and testing, evaluation is key. Marc stresses the importance of having a consistent process for evaluation to ensure that the results are analyzed correctly.
R – Repeat
The final step in the Dexter Method is to repeat the process. Marc points out that CRO is an iterative process—each round of data collection, execution, testing, and evaluation builds on the last. By continually repeating this cycle, you can refine your strategies and drive ongoing improvements.
The Dexter Method was created to provide a clear, repeatable process that could be easily followed by Marc’s team at Dexter Agency. The method’s structure allows teams to consistently produce high-quality interventions, ensuring that every step from data collection to execution is methodically handled.
Also read: How to Think Like CRO Expert Marc Uitterhoeve
3. The Localization Approach
Don’t look at global best practices when designing interventions. Tailor the experience to different markets and cultures.
Varying test execution parameters can impact test results for the same core hypothesis.
Gabriela Florea, CRO Manager at Verifone, explains this approach with an example. When merchants use Verifone, experts work with them to optimize the checkout page.
And one thing is clear: User behaviors, preferences, and expectations can vary significantly across regions, affecting conversion rates and overall business performance.
Gabriela emphasizes, “Focus on testing it on different markets and different regions, rather than just looking at some standard best practices online and just applying it to your business.”
Consider how different product types and price points affect user behavior in various regions. Gabriela notes, “If you’re purchasing a product that is $20, $30, you’re in for a very fast experience. Whereas if you have a product that entitles $200, $300, $400, you are definitely going to want a review page.”
You also have to adapt payment options to regional preferences. Gabriela explains, “If you are in the US, you’re going to prefer to see Visa and Amex and some other payment methods. But if you go to Asia Pacific, they will have different wallets, different cards.” When designing a test, also remember to use culturally appropriate language. Gabriela found that literal translations often don’t resonate with local users, affecting user trust and willingness to complete purchases.
Gabriela also explains QA and using VPNs is important for these country-level test designs. And if you have a user research, go heavy on user testing.
4. AI-Assisted Prioritization Approach
This approach uses machine learning to predict the outcome of experiments based on historical data. Steph explains:
“We have Confidence AI. (…) Every test is tagged with these meta tags. And what that allows us access to is what we refer to as a second brain.”
The process involves:
- Tagging historical tests with consistent metadata.
- Feeding this historical data into a machine learning model.
- Using the model to predict outcomes of new test ideas.
Steph describes its effectiveness:
“It’s then taking the input of the new experiments in our backlog and those tags and predicting the outcome of them. And it is currently correct 69% of the time when it predicts a winner.”
This method is used primarily for test prioritization:
“It’s really useful for prioritization, especially early on in a program when you’re trying to get that buy-in, you’re trying to get winners out the door as quickly as possible, you can start to prioritize tests that you have a better chance of winning.”
But Steph also notes its limitations: “I don’t see this being 100% accurate. And the reason is that human behavior is not predictable.”
5. The Custom Template Approach
Running multiple experiments demands a streamlined approach, and that’s where templates and best practices make a difference. They don’t just save time—they ensure every experiment is set up with the same level of rigor and attention to detail.
Will Laurenson, CEO and Lead Consultant at Customers Who Click, shared how his team uses templates to standardize processes: “We’ve set up Notion boards where someone can follow a templated audit. It allows us to quickly go in and check the most important things.” This approach reduces the risk of oversight, ensuring that critical elements are always addressed.
But following a template shouldn’t mean blindly following best practices. The real power of templates comes from their adaptability. They provide a strong foundation, but the best teams know when to customize them. As Will put it, “You still need to think about it and say, does this actually apply to our products? Is this going to make sense for what we sell?” Adaptation is crucial. While a template might suggest a certain strategy, you should always tailor it to your audience’s specific needs. Effective teams treat templates as living documents, continuously refining them based on new insights from ongoing experiments.
As Will explains, blindly implementing best practices instead of actually testing stems from a lack of knowledge and a desire for quick wins.
According to our survey, 52% of optimizers rely on internal brainstorming and creative thinking as integral parts of their process. The key is finding a balance—using data to validate and refine creative ideas, not constrain them.
Shiva Manjunath, Host of the Experimentation Podcast ‘From A to B,’ and experimentation manager passionate about learning-focused experimentation at Motive, captures this approach well:
“First-party data. Always. Talk to customers, gather both qualitative and quantitative research for your site.”
The Pros and Cons of Using Proven Execution Patterns
Leveraging proven execution patterns, such as those mentioned in Ron Kohavi’s LinkedIn post about Evidoo, offers advantages and challenges. Using established patterns can lead to quicker wins, particularly when you lack the time or resources to test new hypotheses from scratch.
For instance, Evidoo provides a rich set of patterns already validated through thousands of experiments, making them an attractive starting point for new A/B tests. But there are downsides to relying too heavily on these patterns. As Ron points out, many experiments are focused on specific transactions (e.g., purchases), which might not always align with broader business goals like overall customer experience. And applying these patterns without proper customization can lead to suboptimal results if they don’t fit the specific context of your audience or product.
Stuti Kathuria’s LinkedIn post shares an example where she adjusted the placement of the add-to-cart CTA on a product detail page (PDP) to improve conversion rates. While the adjustment was based on proven strategies, the key to success was her ability to recognize the specific needs of the audience and customize the implementation. Trying to ape strategies from posts like these is not a good idea.
And Shiva Manjunath explains why:
“I have heavy skepticism towards ‘established CRO patterns.’ Assume it’s based on statistically powered and accurate data (which I doubt, in many cases). The patterns are based on data that aren’t YOUR customers, nor run on YOUR site. Data unique and specific to your customers and solving problems they face, are always going to yield better results than trying to test random solutions other websites have had success with.
Would you rather take a generic multivitamin that is not unique to your body? I would prefer to solve problems unique to my body, in the same way we should prefer first-party data over generic solutions.
Data unique and specific to your customers, and solving problems they face, are always going to yield better results than trying to test random solutions other websites have had success with.
You may have success with ‘best practices.’ You WILL have success with solving for your user’s actual problems using first-party data on your customers.”
When to Use Proven Patterns
Proven patterns can be incredibly useful when employed strategically:
- Use Them as a Foundation: Especially if you’re new to experimentation or need a quick win. These patterns provide a solid foundation.
- Customize and Adapt: Always adapt these patterns to fit the specific context of your product, audience, and market. This ensures the strategy remains relevant and effective.
Test and Iterate: Even with proven patterns, testing is essential. What worked for others might not work for your audience. Use these patterns as a starting point but be ready to iterate based on your unique results.
Context Is Everything When Designing Tests
While data and frameworks like the Lever Framework or Dexter Method provide valuable guidance, understanding the specific context of the user or the market is crucial for designing effective interventions.
In the Test Design masterclass, Dave Gowans, CEO of Browser to Buyer, argues that data alone isn’t enough. The context in which the data is applied plays a significant role in the success of any experiment.
For example, a tactic that works well in one market might fail in another due to cultural differences, purchasing behaviors, or even the competitive landscape. We must tailor experiments to the specific environment they’re being applied to rather than rely solely on generalized data or best practices.
You also have to iterate based on contextual feedback. Feedback loops should be integrated into every stage of experimentation. By continuously gathering and analyzing feedback within the specific context of the test, teams can make real-time adjustments that improve the chances of success.
While data is essential, Dave explains that sometimes the best decisions come from a deep understanding of the market and user behaviors that might not be immediately obvious from the data alone. This balance is particularly important in new or rapidly changing markets where historical data may be limited or less relevant.
Implications for Experimentation
Dave’s insights suggest that while frameworks and data are foundational, the real art of CRO lies in the ability to adapt these tools to the specific context in which they are applied. Experimentation should be seen as a dynamic process—one that requires continuous learning, adaptation, and iteration based on the unique circumstances of each test.
In practice, this means that teams should:
- Prioritize contextual understanding as much as data analysis.
- Use frameworks as flexible guides rather than rigid rules.
- Adapt strategies and tactics based on real-time feedback and local market conditions.
- Balance data-driven insights with intuitive knowledge of the market and user behavior.
We Have A Problem: Solution-Izing (But AI Can Help)
“Solution-izing,” as discussed by Craig and Marcella Sullivan and Iqbal Ali, is the tendency to jump to solutions without fully understanding the problem.
Craig shares, “One big common problem for product managers and designers… is that because you’re so close to it, you think… we know how to fix that.” This rush to solutions often stems from a superficial understanding of the problem, leading to quick fixes rather than addressing the root causes.
Iqbal further explains that humans bring a lot of “baggage and bias” into the creative process, which can limit their ability to think outside the box. This is where the danger of solution-izing becomes apparent. By narrowing their focus too quickly, teams may miss out on more innovative or effective ideas that could be explored with a deeper understanding of the problem.
How AI Helps Combat Solution-izing
AI plays a crucial role in addressing the pitfalls of solution-izing by encouraging a more comprehensive approach to problem-solving. As Craig puts it, “AI helps… because what you can see at that point is actually, I don’t really want to think about solutions until I’ve asked AI to give me the widest possible fields of solutions that I do not have in my brain.”
Here’s how AI contributes:
- Diversity of Ideas: AI significantly increases the diversity of ideas generated during the ideation process. Iqbal mentions that in workshops, people might start with a few ideas, but once they incorporate AI, “they come up with five or six ideas.” This additional input from AI helps teams consider a broader range of possibilities, preventing them from prematurely settling on a single solution.
- Bias Reduction: Craig and Iqbal drive home how AI can challenge human biases. Craig notes that AI can provide “perspectives on the problem that [humans] would normally, through their biases, not consider.” By doing so, AI helps teams break free from conventional thinking, opening up new avenues for creative problem-solving.
- Enhanced Problem Understanding: AI aids in thoroughly understanding the problem before moving to solutions. Craig stresses the importance of spending more time on problem identification, saying, “AI can actually help people to dig through the problems and actually flesh out this 360-degree view.” This deeper analysis ensures that the solutions developed are more aligned with the true nature of the problem.
- Acceptance of Ideas: Interestingly, AI-generated ideas often face less resistance from teams. Craig observed that “people are prepared to take ideas from AI that they would not take from a colleague because there’s politics involved.” This openness to AI-driven suggestions can lead to more innovative solutions being adopted.
Get the FREE Convert AI playbook — a practical, research-backed strategies designed for CRO, UX, and Product teams.
With over 3,000 hours of research and 500+ hours of real-world testing, this playbook will supercharge your workflow and help you use AI with confidence. No spam, no follow-ups—just pure insights.
Bonus
When You Survive The Messy Middle: Successful Interventions Lead To Real Results
We asked experts from around the world to share examples of their most successful interventions, and here’s what they said:
Adding trust signals to checkout page
We realized that the screen where we asked users to enter their payment details was causing a significant drop-off. By simplifying the form and adding trust signals, we saw a 12% increase in completed purchases.
Carmen Albisteanu (Tale), CRO at Salt Bank
Fixing low user engagement on blog with visuals
We faced an issue with low engagement on our blog. After implementing an intervention focused on creating more visually appealing and relevant content, we saw a 30% increase in engagement within the first two weeks.
Doug Haines, CRO at MarketingPros
Simplifying the home page
A homepage redesign resulted in a 6.06% increase in engagement by simplifying the user interface and focusing on the most crucial elements that drive conversions.
Sanne Maach Abrahamsson, Digital Team Lead, CRO & UX, at Lomax A/S
Redesigning website navigation
Leveraging data analysis, customer insights (interviews, surveys and user testing) we mapped out and tested new on site navigation for our B2B client. We leveraged all of our data to map out what pages their ICP need to go through, care about and want to see in order to make a decision and start a free trial. We understood what types of features they care about, what case studies they want to read, and all this helped us create a navigational menu that catered to their needs. We ran the experiment across the entire site and the results of this experiment increased ICP signups by 6% and engagement with the menu increased by 30%.
Talia Wolf, CEO at Getuplift
Redesigning forms for better UX
We ran an A/B test with our web forms to see what worked best. We compared a short form with fewer fields to a longer one with more detailed questions. The shorter form led to more completed submissions, showing that users prefer simpler forms.
Sharat Potharaju, Co-founder & CEO, Uniqode
One of our most famous tests for a SaaS client was the “two step form” in which we moved away from a form with many open fields, and instead replaced them across key touchpoints with a single email field. Revealing only the email field at first lowered the barrier to entry, so we increased form starts. Completes also increased likely because the form had already been started (investment cost) and the process felt smooth and low effort.
Natalie Thomas, Director of Strategy at The Good
Reorganizing PDP to increase add-to-cart rate
A successful intervention significantly impacted our A/B test results for an e-commerce client. By reorganizing product information into easily scannable sections and adding a quick-view feature for product specs, we saw a 28% increase in add-to-cart rate and a 15% decrease in bounce rate.
Chris Kirksey, CEO, Direction.com
Rebranding listicle page to resonate with audience
We took a leap and rebranded our top software listicle page with a more provocative headline, ‘AP That Pays for Itself.’ This approach led to a significant boost in conversions, proving that sometimes you have to challenge the status quo to uncover what truly resonates with your audience.
Updating color scheme
The impact of color: we tried changing a darker green paywall to a lighter paywall. All I can say is: always opt for anything lighter, people smiling, or green over red!
Hanna Grevelius, CPO @ Golf GameBook