9

I have a set of 10 experiments that compute precision, recall and f1-score for each experiment. Now, average precision & average recall is easy to compute. I have some confusion regarding average f1-score.

There are 2 ways on how i can compute mean f1-score:

  1. Take f1 scores for each of the 10 experiments and compute their average.
  2. Take average precision & average recall and then compute f1-score using the formula f1 = 2*p*r/(p+r)

I could not find any strong reference to support any of the arguments. The closest document i could find is this: https://www.kaggle.com/wiki/MeanFScore

Can anyone explain with some reference (if possible) which of the methods is correct and why?

EDIT: One of the members suggested this source. Though, i still suspect the reliability of the source. I have seen people not using the method explained above in their research publications. (Even i would be using it in one of my publications) I would expect some more opinions from the community to verify this idea.

Pinkesh Badjatiya
  • 239
  • 1
  • 2
  • 6

3 Answers3

9

This paper Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement by Forman and Scholz discuss the different methods for computing the average F-score in cross validation. The paper shows that under very high class imbalance some of the computation methods (average of individual folds F-scores or F-score based on average of individual folds precision and recall) can lead to biased results. The paper recommends computing the F-score by adding the TP, FP, FN from each fold, computing the precision and recall and finally the F-score.

tiagotvv
  • 276
  • 1
  • 3
2

As mentioned by other users, the solution is not very clear. The general approach is to follow what is mentioned here.

Also, as suggested by one of the senior R&D employee and my mentor, the method in practice is to calculate average f1-score as the HM of average precision and average recall.

This surely depends on your usecase as well as how you are calculating the metric (micro/macro).

Pinkesh Badjatiya
  • 239
  • 1
  • 2
  • 6
0

As you observed, one can argue for any of your definitions. It is most important that you document what you mean with "mean F1-score". You should also consider which of the 2 options provides a more meaningful evaluation. This depends on your specific application or task.

In my opinion, "mean F1-score" clearly means that you calculate the mean of individual F1-scores. In some situations, option 2 can be described as the overall F1-score. This depends on what you are aggregating. "F1-score of the mean precision and recall" might be a good general description of option 2.

Joachim Wagner
  • 221
  • 2
  • 5