A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments
published: Oct. 9, 2017, recorded: August 2017, views: 1
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment’s outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman’s twelve p-value misconceptions , in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters’ awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !