1

For splitting of the data into train/test/val I use stratified sampling. Is it appropriate to define strata using information extracted from the dataset? E.g. use machine-learning to model proxy variable used for the strata definition?

My worry is the potential data leakage.

I wasn't able to find any counter-argument though.

holoubekm
  • 11
  • 1

0 Answers0