Is normalizing the validation set of time series a kind of look ahead bias?

Question

Here's the data normalization process of a time series in a paper about stock prediction using LSTM:

Split train and test set based on time (e.g. training set: 2001-2010, test set:2011-2012). This looks fine to me.
Normalize the training set by subtracting the mean and dividing them by the standard deviation of the training set.
Train the data, using 20% of samples as the validation set. Keras's model, by default, uses the last 20% for validation.

So, in the training phase, the model knows a little about the validation set through the mean and standard deviation in step 2.

On one hand, the model is training using some of the future information. On the other hand, no information of the test set is leaked to the model. Is this a kind of look ahead bias?

And what is the best practices in this case?

Esmailian · Answer 1 · 2020-06-09T13:00:40.787

To block the data leakage from the validation set to the training set in step (2),

We should first split the data to training and validation sets, then
Calculate the mean and standard deviation only using training set, and finally
Use this mean and std to normalize both the training and validation sets.

This way no information is leaked from validation set to training set, although validation set would be almost normalized. This deviation denotes an informative difference between distribution of validation and training sets which later [rightfully] leads to a validation error larger than training error.

We may think of normalizing the sets separately, however this way we may lose a potential discrepancy between the distribution of validation and training sets.

For example, in a 1D time-series of price, if the average price in training set is $1.0\$$, we expect that a validation set with average $2.0\$$ causes problem for a model that is trained on prices around $1.0\$$. However, by normalizing the validation set separately to $1.0\$$, we wrongly bring the validation set to a range that the trained model already expects, which leads to underestimation of validation error.

Edit

in the training data itself - wouldn't every example be biased by the future? - Alon Gouldman

Correct. If training set is $[P_{t_1}, ..., P_{t_N}]$, then for $n \in [1, N)$ normalized point $\hat{P}_{t_n}$ would be biased toward (contains information of) points $[P_{t_{n+1}}, ..., P_{t_N}]$, but this is not a leakage since the entire training set is what we know, thus we can do whatever we want with it. From another point of view, this seemingly cheaty transformation would turn out OK since (1) the shortcoming of the normalization will be revealed in the validation error (more cheat, more validation error) thus we would not be misguided, (2) normalization could benefit us by lowering the validation error (otherwise, we do not normalize)

in the training data itself - wouldn't every example be biased by the future? — Alon Gouldman, Jun 09 '20 at 12:22

Is normalizing the validation set of time series a kind of look ahead bias?

1 Answers1