3

I have standardized training set using

mean = XTrain.mean()
XTrain-=mean

std = XTrain.std()
XTrain/=std

And then used mean and std to standardize validation and test sets. The training and validation sets have values that are greater than 1 and less than zero is that okay?

skrrrt
  • 304
  • 2
  • 13

2 Answers2

2

Standardization centers the values around a mean of $0$ with standard deviation $1$. Therefore, having values smaller than $0$ or greater than $1$ is to be expected. If you want to make sure values are between $0$ and $1$ you need to normalize the data instead.

Here is an example of the two procedures taken from the book "Python Machine Learning" by Raschka:

enter image description here

Be aware though to apply the procedure to your test data with parameters obtained from the training data (in case of standardization: mean and std. dev. of train data).

Sklearn has methods for standardization and normalization which you might want to have a look at.

Jonathan
  • 5,310
  • 1
  • 7
  • 21
  • Due to some reason my model was performing worse (AUC) when normalized instead of standardized. Is this normal? Also I am performing these operations myself because I have variable length arrays so `sklearn`'s classes don't work there. – skrrrt Jul 20 '20 at 17:57
  • 1
    @skrrrt performance can be different for norm. and stand. It depends on your data and type of model. See, for example, [this question](https://datascience.stackexchange.com/q/43972/84891). – Jonathan Jul 20 '20 at 18:03
2

You're measuring how many standard deviations from the mean a given value is. Certainly values can be many standard deviations from the mean. Even for data with a normal distribution, we expect about $5\%$ of the observations to be more than $2$ standard deviations from the mean, and we expect $32\%$ of the observations to be more than $1$ standard deviation from the mean.

Therefore, it is not at all concerning that you have values more than $1$.

As far as values less than $0$ go, all that means is that you have a value less than the mean. This is common. (While it can happen, consider how to have a data set where no values are less than the mean.)

As Sammy mentioned mere seconds before I posted, be sure to use the mean and standard deviation from your training data when you transform the test and validation data.

Dave
  • 3,841
  • 1
  • 8
  • 23