2

I've read from a source which I forgot where that 'In cross validation, the model with best scores at 95% confidence interval is picked'.
But according to my stat knowledge, in order for CI (confidence interval) to works, you need normality assumption about the sampling statistics of the experiment.
But how come from that unknown source it seems to simply use results from each flow to construct the sample mean & the confidence interval. It seems to me that neither checking if central limit theorem testing at all. And it seems to me this is what people are doing as well:
i) automatically assume normality in sampling MEANS (instead of sampling distribution) ii) CLT automatically satisfied.
May I know if it's my misunderstanding or the industry is adopting a norm which is too loose? Thanks.

Wong
  • 103
  • 4
  • 1
    It's probably the second option, imho it's common in ML applications to play fast and loose with statistical principles for the sake of efficiency (or laziness). This being said, I don't think it's common to even mention confidence intervals with CV. – Erwan Jun 25 '20 at 12:08
  • @Erwan I agree. thanks. – Wong Jun 26 '20 at 02:48

1 Answers1

1

It depends how on the confidence interval (CI) is generated. The most common method is on a sample mean with the assumption that the samples are drawn from a normal distribution. However, a CI can be generated from any statistic on observed data. An alternative method would be through bootstrapping, resampling the statistic, which does not require the normality assumption.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102