Publication | Closed Access
How A/B Tests Could Go Wrong
43
Citations
17
References
2019
Year
Unknown Venue
EngineeringRoot CauseVerificationDiagnosisSoftware EngineeringOn-line TestingData ScienceA/b TestingBiasTest AutomationExperimental EconomicsExperimental TestingTestabilityCould Go WrongStatisticsReliabilityTesting TechniqueComputer ScienceOnline ExperimentsTest ManagementSoftware TestingMassive Growth
Online experiments have grown massively at Internet companies, yet A/B tests can easily fail when users lack experience or governance. The authors aim to build an intelligent A/B platform that democratizes testing and enables quality decisions by mining historical experiments to identify common invalid test causes such as biased design, self‑selection bias, and over‑generalization. They mined historical A/B tests and built scalable algorithms that automatically detect invalid tests and diagnose their root causes. Invalid tests lead to suboptimal business decisions, while surfacing invalidity improves decision quality, educates users, and reduces problematic designs over time.
We have seen a massive growth of online experiments at Internet companies. Although conceptually simple, A/B tests can easily go wrong in the hands of inexperienced users and on an A/B testing platform with little governance. An invalid A/B test hurts the business by leading to non-optimal decisions. Therefore, it is now more important than ever to create an intelligent A/B platform that democratizes A/B testing and allows everyone to make quality decisions through built-in detection and diagnosis of invalid tests. In this paper, we share how we mined through historical A/B tests and identified the most common causes for invalid tests, ranging from biased design, self-selection bias to attempting to generalize A/B test result beyond the experiment population and time frame. Furthermore, we also developed scalable algorithms to automatically detect invalid A/B tests and diagnose the root cause of invalidity. Surfacing up invalidity not only improved decision quality, but also served as a user education and reduced problematic experiment designs in the long run.
| Year | Citations | |
|---|---|---|
Page 1
Page 1