I'm developing a python library for confidence intervals for common accuracy metrics, with both analytic and bootstrap computations.
Following this paper, I implemented the Macro and Micro F1 scores analytic confidence intervals.
However the derivation for the (commonly used, and the default mode in sklearn) binary F1 score is missing there, as well as the Macro Recall and Precision cases.
After a lot of searching I couldn't find an analytic expression of the binary F1 score confidence interval / variance anywhere, so it looks like the way to go is to derive it according to the spirit of the paper above, using the multivariate delta method.
I would be grateful if anyone could comment on the correctness of this derivation, if I made an error somewhere, and any general comments you might have.
- The f1 score is $ \frac {2p_{11} }{2p_{11} + p_{01} + p_{10}} = \frac {2p_{11} }{d} $
- Use the delta method to get the variance of f1: $$ \sqrt n (f1 - f1_{real mean}) = N(0,\frac{\partial f1}{\partial p} ^ T [diag(p) - pp^T] \frac{\partial f1}{\partial p} )$$
- Compute the derivative $\frac{\partial f1}{\partial p}$: $$\frac{\partial f1}{\partial p_{00}} = 0$$ $$\frac{\partial f1}{\partial p_{10}} = \frac{\partial f1}{\partial p_{01}} = -2\frac {p_{11}} {d^2} = -\frac {f1} {d}$$ $$\frac{\partial f1}{\partial p_{11}} = \frac 2 {d} - \frac {4 p_{11}} {d^2} = \frac {2(1-f1)} {d} $$
- Now we can plug in the derivative vector of length 4 in equation(2), and the estimation of the variance will be $ \frac{\partial f1}{\partial p} ^ T [diag(p) - pp^T] \frac{\partial f1}{\partial p} $
Finally, the variance computation will be based on the multiplication of the row vector x matrix x column vector.
Similarly i'm planning to do the same for the Macro Recall and Precision cases, but I first want to make sure the derivation for the simpler binary case makes sense.
Thank you!