How do I make inference about test metrics for entire population from sample metrics?

Question

Generally we calculate specific metrics for ML models on a test set (and we try to make that test set representative). I'm not clear on how to make inference about the same metrics for the population that the test set is representing - i.e., say I want to answer: if the model were to run on the whole population, what's the confidence interval of metric in question at (e.g.) 95% significance level?

Now for a simple case I can try to use my basic stats knowledge: suppose I have a binary classification model and I'm interested in reporting its precision.

I measure the precision on the test set and define the test statistic as the sample proportion $\hat p$ of correctly classified examples out of total examples
I also run the model on different folds of data to get precision on each fold, and then calculate the standard deviation of those different sample (fold) precision values - call it $\bar\sigma$: this is my proxy for the standard deviation of the sampling proportion distribution.
ALTERNATIVELY, I can measure the standard deviation for each fold $i$ as $\sigma_i=\sqrt{np_i(1-p_i)}$ where $p_i$ is the precision measured in that fold. I'm assuming Binomial distribution with sequence size $n$ and probability of "success" (correct prediction) as the precision $p_i$. Then I take the average of all these $\sigma_i$ to get an estimate of the "population standard deviation" and then divide that by $\sqrt{n}$. i.e. If the number of folds I considered was $k$, then $\bar \sigma=\Sigma_{j=1}^k\sigma_j/(k\sqrt{n})$
Using either of the methods in 2 or 3 to calculate $\bar\sigma$, we estimate the population precision as $\hat p\pm 1.96\bar\sigma$

Or I could just calculate the interval as (assuming test set size $m$)$$\hat p\pm t_{m,95\%}.\sqrt{\frac{\hat p(1-\hat p)}{m}}$$ where $t_{m,95\%}$ is the t-distribution value corresponding to 95% significance level and sample size $m$.

But what about other metrics like precision-recall combo, mean absolute percentage error, mean absolute error, RMSE, etc. etc.? Obviously I'm not expecting a recipe for each metric, but just a general idea on how we go about getting interval estimates for arbitrary metrics. Also, does the methodology described above seem correct?

Is your question targetted at RMSE, MAE etc because their distribution is unknown? Is the question about how to build confidence intervals on random variables with unknown distributions and small sample sizes? — Jayaram Iyer, May 19 '21 at 16:56
Maybe you should take a look at markov and chebyshev's inequalities. Excellent question though. — Jayaram Iyer, May 19 '21 at 16:56
@JayaramIyer: Thanks! And the motivation is like this: usually in data science projects we quote point estimates of metrics. i.e. we calculate metric for the test set and use it as a point estimate for the population. But what if I want to go one step further and give an interval estimate at, say, 95% significance level? I can treat the value of the metric as a random variable, as you said, investigate its sampling distribution and make an inference about it at the population level. — Shirish Kulhari, May 19 '21 at 17:11
@JayaramIyer: Maybe for RMSE, instead of that I can consider squared error as my random variable. For the $i$-th example, $X_i$ is the squared error. Its population mean would be the "true" MSE, while the "sample mean" would be our test set MSE. Maybe we can use CLE to get the interval estimate? But then I'm not sure if the random variables in question can be considered iid. Independent - why not, but identical? I think the model would have more trouble correctly classifying/predicting certain sections of the population, so the identical part is a challenge. Maybe I'm overthinking — Shirish Kulhari, May 19 '21 at 17:14
I think not all metrics (like MAE) can be handled this way, in fact there is no known method to generalise them in a statisticaly significant way. In any case I think this question is more suitable for https://stats.stackexchange.com/questions — Nikos M., May 19 '21 at 18:52
For linear regression, isnt the assumption that the residuals are normally distributed? In that case wouldn't it be easy to come up with a 95% confidence interval ? — Jayaram Iyer, May 20 '21 at 08:16
@JayaramIyer True, but I meant usage of RMSE (and other metrics) in arbitrary models. I'm starting to think that there's no model independent way of getting confidence intervals, but I'll need to do more digging — Shirish Kulhari, May 20 '21 at 09:08

How do I make inference about test metrics for entire population from sample metrics?

0 Answers0