Does the test set has to be in [0,1] range?

Question

I have standardized training set using

mean = XTrain.mean()
XTrain-=mean

std = XTrain.std()
XTrain/=std

And then used mean and std to standardize validation and test sets. The training and validation sets have values that are greater than 1 and less than zero is that okay?

I assume this is just a typo but in line 5 you need to divide by the standard deviation instead of the mean again — Jonathan, Jul 20 '20 at 17:33

score 2 · Accepted Answer · answered Jul 20 '20 at 17:31

2

Standardization centers the values around a mean of $0$ with standard deviation $1$. Therefore, having values smaller than $0$ or greater than $1$ is to be expected. If you want to make sure values are between $0$ and $1$ you need to normalize the data instead.

Here is an example of the two procedures taken from the book "Python Machine Learning" by Raschka:

Be aware though to apply the procedure to your test data with parameters obtained from the training data (in case of standardization: mean and std. dev. of train data).

Sklearn has methods for standardization and normalization which you might want to have a look at.

answered Jul 20 '20 at 17:31

Jonathan

5,310
1
7
21

Due to some reason my model was performing worse (AUC) when normalized instead of standardized. Is this normal? Also I am performing these operations myself because I have variable length arrays so `sklearn`'s classes don't work there. – skrrrt Jul 20 '20 at 17:57
1

@skrrrt performance can be different for norm. and stand. It depends on your data and type of model. See, for example, [this question](https://datascience.stackexchange.com/q/43972/84891). – Jonathan Jul 20 '20 at 18:03

score 2 · Answer 2 · answered Jul 20 '20 at 17:32

You're measuring how many standard deviations from the mean a given value is. Certainly values can be many standard deviations from the mean. Even for data with a normal distribution, we expect about $5\%$ of the observations to be more than $2$ standard deviations from the mean, and we expect $32\%$ of the observations to be more than $1$ standard deviation from the mean.

Therefore, it is not at all concerning that you have values more than $1$.

As far as values less than $0$ go, all that means is that you have a value less than the mean. This is common. (While it can happen, consider how to have a data set where no values are less than the mean.)

As Sammy mentioned mere seconds before I posted, be sure to use the mean and standard deviation from your training data when you transform the test and validation data.

Does the test set has to be in [0,1] range?

2 Answers2