A/B test results don't hold in hold-out dataset

Question

I'm working on a project consisting in analyzing some game data. The game consists of two players competing between them. The result of the game is either win or lose. Players can choose from different initial positions. My objective is to analyze the winning rate for the first player for different starting positions, ie: is there any initial position where the first player has an advantage?

To do so I have split my data into two sets: train and test. Then, I run an A/B test in the training data, where group A is a starting position and B is another starting position. I want to test if the winning rate for B is higher than the winning rate for A.

What I've seen is that initial position B is significantly better than the initial position for A (p-value < $10^{-7}$. However, when I run the A/B test in the hold-out dataset, I get that the results are not significant (p-value > $0.05$).

Do you have any idea why this could happen? Have you ever seen a case where the results of an A/B test didn't hold in a different slice of data?

Thanks!

A/B test results don't hold in hold-out dataset

0 Answers0