4

I have a dataset of the following form:

System A Rating System B Rating
4.5 5
3 4
5 3
etc. etc.

I have 155 such data points gathered using a survey. Which statistical test should I use to show statistical significance if one system has significantly better ratings than the other?

Thanks.

hH1sG0n3
  • 1,978
  • 7
  • 27
Michael Pulis
  • 103
  • 1
  • 9
  • I would use Average value as a the rating and standard deviation as the reliability of the rating. – Ubikuity Jun 08 '21 at 19:15
  • Do you mean that the rows correspond to different items, and for each item you have a rating for system A and system B? Do you have a gold standard indicating what the ratings should be? If not I don't see how you can determine which system is better. – Erwan Jun 08 '21 at 22:22
  • Each row corresponds to a user who was rated both system A and system B. – Michael Pulis Jun 09 '21 at 08:31
  • But do you have a way to determine which system is "better"? Normally you need a gold standard (also called ground truth) for that. otherwise you just have two different predictions but you don't know if they are good or bad. – Erwan Jun 09 '21 at 16:59
  • They are not predictions. They are user ratings from two different systems. Each row represents a user, and how they rated the recommendations from system A, and system B. I want to know which is the best statistical measure to establish statistical significance, such as paired-t-test or wilcoxon ranked. – Michael Pulis Jun 09 '21 at 19:06
  • @MichaelPulis oh ok I didn't understand the question. Then I think a paired Wilcoxon test is what you need here. Student test cannot be used unless you know that the distribution is normal, which is unlikely. With this kind of result I also like to plot the two distributions of scores, sometimes there's a clear difference. – Erwan Jun 14 '21 at 18:12
  • Those sound like sensible suggestions. Thanks – Michael Pulis Jun 15 '21 at 11:19
  • Is there any kind of pairing between the rows? – Dave Jun 15 '21 at 12:04
  • The row represents one user, who has rated a list of recommendations. From this list, there were actually two systems at play, System A and System B. The two values in the row represent the extracted average for the recommendations made by both systems, for the same user. – Michael Pulis Jun 15 '21 at 12:17

2 Answers2

2

You need to perform some tests to identify the appropriate statistical measure for comparing the two distributions accurately.

  1. For each group/system, run a normality test to make sure that you are not dealing with an ultra exotic distribution to which central limit theorem does not apply (very unlikely).

  2. Calculate the variance for each group, in order to see whether you can use tests that make the assumption of equal variances. To statistically test if the variances can be assumed to be equal, you can perform a Levene's test.

If variances are equal you can move ahead with independent unpaired t-test, otherwise you should trust Welch's test. You can perform these tests online as well, e.g. here https://www.graphpad.com/quickcalcs/ttest1.cfm

hH1sG0n3
  • 1,978
  • 7
  • 27
2

Mann Whitney U Test (Wilcoxon Rank Sum Test) shall enable us to compare the ratings on System A and system B. This test compares the shape of each population and tells whether the two samples differ. If the shape is same, the null hypothesis is accepted. If the shape of one of distribution is different from the another distribution, the two systems have a significant difference in observed ratings.

Subhash C. Davar
  • 578
  • 4
  • 18