1

Currently I'm doing some simple feature selection based on correlation between features and variance within one feature. I'm applying this on the whole dataset used for model building before creating the cross-validation.

My question now is if this is acceptable workflow or can significantly affect the CV stats suggesting a better model than it actually is?

Is it technically better to do the CV-split and only then select features on the training set for that split to not leak information?

beginner_
  • 111
  • 1

1 Answers1

3

If you do not use target for feature selection, then there is no leakage and you can apply this on the whole dataset. The methods that you mention: checking correlation between features and checking variance of each feature belong here.

If you do use target for feature selection, then you should use 'train' subset only. Otherwise there is a leakage. The examples here would be (i) you check correlation between each feature and target (ii) use build Random Forest model and take best features

lanenok
  • 1,476
  • 9
  • 9