To block the data leakage from the validation set to the training set in step (2),
- We should first split the data to training and validation sets, then
- Calculate the mean and standard deviation only using training set, and finally
- Use this mean and std to normalize both the training and validation sets.
This way no information is leaked from validation set to training set, although validation set would be almost normalized. This deviation denotes an informative difference between distribution of validation and training sets which later [rightfully] leads to a validation error larger than training error.
We may think of normalizing the sets separately, however this way we may lose a potential discrepancy between the distribution of validation and training sets.
For example, in a 1D time-series of price, if the average price in training set is $1.0\$$, we expect that a validation set with average $2.0\$$ causes problem for a model that is trained on prices around $1.0\$$. However, by normalizing the validation set separately to $1.0\$$, we wrongly bring the validation set to a range that the trained model already expects, which leads to underestimation of validation error.
Edit
in the training data itself - wouldn't every example be biased by the future? - Alon Gouldman
Correct. If training set is $[P_{t_1}, ..., P_{t_N}]$, then for $n \in [1, N)$ normalized point $\hat{P}_{t_n}$ would be biased toward (contains information of) points $[P_{t_{n+1}}, ..., P_{t_N}]$, but this is not a leakage since the entire training set is what we know, thus we can do whatever we want with it. From another point of view, this seemingly cheaty transformation would turn out OK since (1) the shortcoming of the normalization will be revealed in the validation error (more cheat, more validation error) thus we would not be misguided, (2) normalization could benefit us by lowering the validation error (otherwise, we do not normalize)