The significance score we provide can be interpreted as the confidence we have that a result is significant.
Usually in statistics, once a result has a higher than 95% significance score we can consider this result Statistically Significant. That means that the change you are seeing, is unlikely enough to be due to chance alone, that it is reasonable to conclude that the difference is due to the change you are testing.
We only have control of certain parts of the input data coming in, and this means that the statistical significance we present is not definitive.
We are relying on a few things:
An even distribution of searches vs userTokens
We make the assumption that your searches are evenly distributed among your users. i.e. each userToken makes the same number of searches on average. In some situations, such as for example when you have a back end sever sending lots of requests with the same userToken, there might be a significant skew in the results.
How to check it: Look at the number of tracked searched, is it matching the supposed split of the test?
In the example above, the first search had 10,000 searches from 100 userTokens which suggests that a lot of searches are being sent from the same userToken as 1000 searches per user is unlikely. Below, we can see 1000 searches for 100 users, so an average of 1 search per user, which suggests a more even distribution. If we see a result like this then it suggests that we have a single source generating searches that must be excluded from the test.
A consistent test period
Any change in your index configuration during and AB test can create distortions in your results. This is not something we can detect when computing the significance score.
Problems with your events recording
If there is an issue with the recording of your events then this will have a knock on impact on your AB test result. Make sure you are following the steps to validate your events.