

95% and 97.5% are the confidence levels for which the intervals were constructed (nominal level). RelCI are intervals for relative difference, while AbsCI are naive extrapolations to relative difference of intervals for absolute difference. It turned out that there are two factors that affect how badly the naive extrapolation from absolute to relative difference will perform: the size of the true relative difference, and the confidence level. The goals were: to understand how much worse the latter is to the former, how big the discrepancies between nominal type I error and actual coverage and type I error are, and what factors affect them the most. In order to understand the issue, I’ve conducted 8 million simulations with 80 different combinations (100,000 sims each) of baseline event rates, effect sizes, and confidence levels, comparing the performance of proper confidence intervals for percent change (% lift) and the approach described above: a naive extrapolation of confidence intervals for absolute difference to ones about relative change. The proper confidence interval in this case spans from -0.5% to 43.1% percent change which covers the “no change” value of 0%, while the proper p-value is 0.0539, meaning that the result is not statistically significant at the 0.05 significance threshold.

The division by P A adds more variance to δ rel and δ relPct so there is no simple correspondence between a p-value or confidence interval calculated for absolute difference and relative difference (between proportions or means). The reason for this is simple: the statistic we are calculating the p-value and confidence interval for is for the absolute difference: δ abs = (P B – P A), while the claims are for the relative difference: δ rel = (P B – P A) / P A or the percentage change δ relPct = (P B – P A) / P A x 100 . As a result, one would proceed to act based on a false sense of security since the uncertainty of the inference is greater than it is believed to be. The above process results in inaccuracy of the estimation of the type I error probability ( α), meaning that the nominal error level does not match the actual error level. Can one claim that such an interval was generated by a procedure which would, with probability 95%, result in an interval that covers the true relative difference? Put in terms of confidence intervals, can one simply convert the 0.0003 and 0.0397 bounds to relative ones by dividing them by the baseline conversion rate? This will result in a confidence interval (relative) or in percentages: interval for percent effect. The question is: can one then claim that they have a statistically significant outcome, or a confidence interval that excludes 0% relative difference? (pardon the difference in notation on the screenshot: “Baseline” corresponds to control (A), and “Variant A” corresponds to treatment (B)) The result is statistically significant at the 0.05 level (95% confidence level) with a p-value for the absolute difference of 0.049 and a confidence interval for the absolute difference of : Treatment event rate (conversion rate) P B: 0.12 (12%) Say, for example that we have conducted a simple fixed sample size experiment with a superiority alternative hypothesis ( H 0: δ ≤ 0, H A: δ > 0) with the following outcome:Ĭontrol ( A) & treatment ( B) group observations: 1360 eachĬontrol event rate (conversion rate) P A: 0.10 (10%) Why standard confidence intervals and p-values should not be used for percent change In this article I will examine how big the issue is and provide proper calculations for p-values and confidence intervals around estimates for percent change. This leads to reporting nominal significance and confidence levels that correspond to lower uncertainty than there actually is, resulting in false certainty and overconfidence in the data. In many cases claims about the relative difference in performance between two groups based on statistical significance and confidence intervals are made using calculations intended only for inferences about absolute difference. are used as opposed to “absolute difference”, “absolute change” and so on. In A/B testing as part of conversion rate optimization and in marketing experiments in general we use the term “percent lift” (“percentage lift”) while in most other contexts terms like “relative difference”, “relative improvement”, “relative change”, “percent effect”, “percent change”, etc. A/B tests) the result of interest and hence the inference made is about the relative difference between the control and treatment group. In many controlled experiments, including online controlled experiments (a.k.a.
