1

I read a few blog posts and found some books that caution against data leakage when adjusting the training data using any information from the held out datasets. For example, if I were to standardize the variables using a mean and variance calculated from all the training, valid, and test sets. There will be some information in the mean/variance that is due to the valid and test sets.

However, this does not seem to be a problem for me. Consider this general argument for standardizing variables in some way. For any set of numbers a and and any set of numbers b, if we were to apply them to rescale the data through standardization, then the model would not do better as it will only adjust the location and shape of the distribution of data and the internal relations among datapoints remain similar. A set of means and a set of standard deviations happens also to be one of those sets of numbers a and sets of numbers b. So applying that does not affect the modeling process. Is there a better argument for why it does than "the mean and variance is based on unseen data".

For the intuition why this is the case, consideration a cloud of data with 2 features x1 and x2 and some binary label y. If you fit a simple logit or a decision tree, for example, it will recommend splits on the data on some dimension on x1 and x2. If you were to shift the locations via some set of numbers a (a1 for x1 and a2 for x2), the recommended splits could also just be shifted. If you were to rescale the data by dividing by and numbers b (b1 for x1 and b2 for x2) the recommended splits could also just be rescaled. The model will just adjust itself as the internal structure of that data is still the same. It does not seem unreasonable to extend this to more features and for continuous or multiclass data.

I also tested this empirically on one example dataset (binary classification, 40k observations, 100 features to be cleaned and process) and found that the same model (I ran this several times with a variety of penalized logits) that was fit with leaked data through the mean and standard deviation performed about the same on average compared with a model fit without leaked data (only the training data was used to calculate the mean and variance).

I am curious, does anyone have any dataset or model or metric where it was shown that any general data leakage of this type will help performance consistently on unseen data? Even some book that warned on this showed that the leak data model performed slightly worse in some metric by a "coincidence".

dzheng1887
  • 11
  • 3
  • Hey, someone has shown me this link https://datascience.stackexchange.com/questions/88924/… I will agree, if the mean and standard deviation is practically the same because there is sufficient data and stationarity between the training/valid datasets, then the estimates are close enough that it doesn't really matter. In my case, I have found that the mean and std deviation is not exactly the same between the two datasets. I am testing if the model fit with data leakage is consistently better even if better means to the 10th decimal in some metric. – dzheng1887 Nov 13 '22 at 00:39
  • See also https://stats.stackexchange.com/q/588185/232706. – Ben Reiniger Nov 13 '22 at 02:06
  • 1
    Thank you for sharing, I essentially have the same question. I am also willing to share my python code/data to provide the result I mentioned. Please let me know, and I welcome anyone to adjust it to show the case where using validation data in the summary stats will help a model fit on unseen data. Also, I adjusted the question to be a bit more clear on the argument why the shifting/rescaling doesn't seem like it should matter. – dzheng1887 Nov 14 '22 at 00:49

0 Answers0