I read a few blog posts and found some books that caution against data leakage when adjusting the training data using any information from the held out datasets. For example, if I were to standardize the variables using a mean and variance calculated from all the training, valid, and test sets. There will be some information in the mean/variance that is due to the valid and test sets.
However, this does not seem to be a problem for me. Consider this general argument for standardizing variables in some way. For any set of numbers a and and any set of numbers b, if we were to apply them to rescale the data through standardization, then the model would not do better as it will only adjust the location and shape of the distribution of data and the internal relations among datapoints remain similar. A set of means and a set of standard deviations happens also to be one of those sets of numbers a and sets of numbers b. So applying that does not affect the modeling process. Is there a better argument for why it does than "the mean and variance is based on unseen data".
For the intuition why this is the case, consideration a cloud of data with 2 features x1 and x2 and some binary label y. If you fit a simple logit or a decision tree, for example, it will recommend splits on the data on some dimension on x1 and x2. If you were to shift the locations via some set of numbers a (a1 for x1 and a2 for x2), the recommended splits could also just be shifted. If you were to rescale the data by dividing by and numbers b (b1 for x1 and b2 for x2) the recommended splits could also just be rescaled. The model will just adjust itself as the internal structure of that data is still the same. It does not seem unreasonable to extend this to more features and for continuous or multiclass data.
I also tested this empirically on one example dataset (binary classification, 40k observations, 100 features to be cleaned and process) and found that the same model (I ran this several times with a variety of penalized logits) that was fit with leaked data through the mean and standard deviation performed about the same on average compared with a model fit without leaked data (only the training data was used to calculate the mean and variance).
I am curious, does anyone have any dataset or model or metric where it was shown that any general data leakage of this type will help performance consistently on unseen data? Even some book that warned on this showed that the leak data model performed slightly worse in some metric by a "coincidence".