0

I would like to ask a question about the dataset splitting. If I have a dataset can I perform preprocessing (imputation, scaling, ecc.) on the entire dataset and then splitting it into train and validation? If I don't have the test set I often create new inputs by myself to test it.

Or should I split the dataset firstly and then performing preprocessing on the training set and applying the scaler and the imputer to the validation one?

2 Answers2

1

In principle, you can do many preprocessing activities (e.g., converting data types, removing NaN values, etc.) on the entire dataset since it does not make a difference whether these steps are separated into training and test set.

However, when for instance using a Standard Scaler, you should fit the scaler on the training data (usually including the validation set) and transform both the training and test data on this fitted model. This prevents information from the unseen test set to spill over into the training process. For some further discussion on the fit and transform of a Standard Scaler, you can look here: StandardScaler before or after splitting data - which is better?.

The same is true for removing outliers or imputing missing values (e.g., by the mean of the respective column). In this case, you should use the respective statistic of the training data for imputation and thus split the dataset before imputation.

Usually, the validation is treated as part of the training set (with K-fold cross-validation, there might not even be a fixed validation set), while the test set is separated as early as possible.

Hope this gives you a bit of guidance.

justinlk
  • 126
  • 5
  • Many thanks for your reply. So, is my approach right? I usually use the entire dataset as the training one and I perform all the preproccessing steps on it, then I split it into train and validation for the hyperparameter tuning and evaluation, later I import other data to be used as test, obviously I process them using the scaler and the imputer created on the train+validation one. My only doubt is when I use SMOTE for balancing the dataset, in that case SMOTE can be used on the trainin+validation or only on the training one? – Flavio Brienza Feb 19 '23 at 14:51
  • Yes, I believe your approach is right. I would personally use SMOTE on training and validation set and then use cross-validation for hyperparameter tuning to reduce possible overfitting. – justinlk Feb 19 '23 at 15:06
  • Because I have seen that by using SMOTE on the entire dataset and then splitting it into train and validation significantly improves the result on the validation one, but the performance is reduced on the test set. I think that it is normal since if I use it on the entire dataset and then I split it, in the validation there could be some data of the training one. So I do not know if I can use it on the entire dataset and then splitting it into train and validation, or splitting before even if I am not using the other dataset as the test one. I use this approach: https://youtu.be/JnlM4yLFNuo – Flavio Brienza Feb 19 '23 at 15:23
  • It is for sure not wrong to do it the way you are doing it, i.e., splitting training and validation set before SMOTE. If you find that test results improve, then stick with your approach! And again, if you feel like the validation set might introduce overfitting to your model, consider using k-fold cross-validation instead. – justinlk Feb 19 '23 at 15:28
  • Many thanks, is the link approach wrong or not? Because I learned how to use SMOTE from this, but some comments say that it is wrong. – Flavio Brienza Feb 19 '23 at 15:35
  • Yes, the approach in the video is wrong. There, SMOTE is performed on both the training set and the test set (in the video's notebook "X" and "y"), which you should never do. Instead, following the video's code, performing SMOTE on "X_train" and "y_train" (defined in the video notebook one cell below SMOTE) would have been the correct approach. – justinlk Feb 19 '23 at 15:42
  • If instead of calling it X_test I'd call it X_validation, would the approach be right? Because I think that he wanted to mean the test set as the validation one. He hase only one dataset that is X. – Flavio Brienza Feb 19 '23 at 15:44
  • In the last cell of the notebook, he is printing a classification report (i.e., performance metrics) using "y_test", which has been part of the data used for SMOTE. So I still believe that his approach is wrong and the classification report does not reflect an unbiased evaluation of the model. If you have another dataset available to test the model in an unbiased way, using X_test and y_test as validation set for model training would be fine. – justinlk Feb 19 '23 at 15:54
  • 1
    Indeed I think exactly like you. In my opinion he should have called the test one validation, which has very high results (not normal for a test set), then he should have imported another dataset to test the results. For me it is not wrong the approach but the variables' names. You can like him to see which is the result on the validation set. But the best approach is to use SMOTE on the entire dataset and then Grid or Randomized Search CV to find the best parameters – Flavio Brienza Feb 19 '23 at 16:09
0

It's better to split the data into training and testing sets before doing things like scaling and imputation. This is because these steps are usually done using parameters learned from the training set, and then the same changes/parameters are applied to the testing set.

The test data should be representative of the data that the model is expected to encounter in real-life/production scenarios. This means that the test data should be in the same format and distribution as the data that the model will be used on in the future. This helps to ensure that the model will generalize well and make accurate predictions on new, unseen data.

If you do the split after these steps, you might accidentally include information from the test set in the training set (data leakage), which can make it look like the model works better than it actually does.