Seven pitfalls to avoid when running controlled experiments on the web

TLDR

Controlled experiments (A/B tests) have impacted fields such as medicine, agriculture, manufacturing, and advertising, yet the practical aspects of running them online remain underdeveloped, as highlighted by prior surveys and the need for techniques like gradual treatment ramp‑up to protect users. This paper aims to identify and discuss the pitfalls encountered when conducting online controlled experiments, drawing on Microsoft’s extensive experience. The authors enumerate pitfalls spanning statistical assumptions—such as applying standard deviation and power formulas without adjustment—and operational issues like neglecting robot traffic, which are specific to online experiments. They found that Simpson’s paradox can easily lead to misidentifying the superior treatment when using gradual ramp‑up.

Abstract

Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. While the theoretical aspects of offline controlled experiments have been well studied and documented, the practical aspects of running them in online settings, such as web sites and services, are still being developed. As the usage of controlled experiments grows in these online settings, it is becoming more important to understand the opportunities and pitfalls one might face when using them in practice. A survey of online controlled experiments and lessons learned were previously documented in Controlled Experiments on the Web: Survey and Practical Guide (Kohavi, et al., 2009). In this follow-on paper, we focus on pitfalls we have seen after running numerous experiments at Microsoft. The pitfalls include a wide range of topics, such as assuming that common statistical formulas used to calculate standard deviation and statistical power can be applied and ignoring robots in analysis (a problem unique to online settings). Online experiments allow for techniques like gradual ramp-up of treatments to avoid the possibility of exposing many customers to a bad (e.g., buggy) Treatment. With that ability, we discovered that it's easy to incorrectly identify the winning Treatment because of Simpson's paradox.

References

Page 1

	Year	Citations

Page 1