When running an A/B test, we will measure a confidence score that shows how likely it is that these results will continue. Achieving a 95% confidence threshold shows that your test is statistically significant.
Sometimes your A/B test will reach a 95% confidence threshold, then drop below 95% later in the test.
Please read the following resource which is very complete and allows for better understanding of A/B testing: https://www.alexbirkett.com/ab-testing/
Concerning this specific scenario, this behavior is (as per the item 7 in the list) completely normal:
As it turns out, an A/B test can dip below a .05 p-value (the commonly used rule to determine statistical significance) at many points during the test, and at the end of it all, sometimes it can turn out inconclusive. That’s just the nature of the game.
Here are some potential reasons:
- Variability in the Data: A/B test results can be quite volatile, especially when the sample size is small. It's possible that during the early stages of the test, the observed difference was due to chance, and as more data was collected, the true difference (if any) became clearer.
- External Factors: External events could influence the behavior of users. For example, a major holiday, sale, or event could skew results temporarily while you're testing changes on an e-commerce site.
- Sampling Bias: If there's any change in the type or behavior of users during the test (e.g., a change in traffic source, a product release, or a marketing campaign), this shift can influence results.
What is the advisable thing for you to do in this situation?
- Review Test Setup: Ensure that there have been no technical issues or changes during the test that could have affected the results. Make sure the groups are still being split correctly, and no outside influences have been introduced.
- Consider External Factors: Were there any external events or factors that could have influenced the results? Understanding these can help in interpreting the drop in significance.
- Increase Sample Size: If the sample size is small, continue running the test to collect more data. A larger sample size can give more accurate and stable results.