The significance score we provide can be interpreted as the confidence we have that a result is significant, and how sure you can be in your conclusions.
Usually in statistics, once a result has a higher than 95% significance score we can consider the result Statistically Significant. That means that the change you are seeing, is unlikely due to chance alone, and that it is reasonable to conclude that the difference is due to the change you are testing.
Important Caveat
We only have control of certain parts of the input data coming in, and this means that the statistical significance we present is not definitive.
We are relying on a few things:
An even distribution of searches vs userTokens
We make the assumption that your searches are evenly distributed among your users. i.e. each userToken makes the same number of searches on average. In some situations, such as when you have a back end server sending lots of requests with the same userToken, the results may be skewed.
How to check it: Look at the number of tracked searched, is it matching the supposed split of the test?
In the example above, the first search had 10,000 searches from 100 userTokens which suggests that a lot of searches are being sent from the same userToken as 1,000 searches per user is unlikely. Below, we can see 1,000 searches for 100 users, so an average of 10 searches per user, which suggests a more even distribution. If we see a result like this then it suggests that we have a single source generating searches that should be excluded from the test.
A consistent test period
Any change in your index configuration during the AB test can create distortions in your results. This is not something we can detect when computing the significance score.
Problems with your events recording
If there is an issue with the recording of your events then this will have an impact on your AB test result. Make sure you are following the steps to validate your events.