Imputation in train or test data

Question

I'm having a rather simple question.

Let's say i want to do a median imputation. I've read in some places that you should do:

imputer = SimpleImputer(strategy='median')
train_imputed = pd.DataFrame(imputer.fit_transform(train[feature_columns]), columns=train[feature_columns].columns)
test_imputed = pd.DataFrame(imputer.transform(test[feature_columns]), columns=test[feature_columns].columns)

But doesn't it means that i'm using the train median and applying it on the test dataset?

Wouldn't it be better to just calculate the test data's median and apply it on NaN values? What's wrong with that approach? In my opinion it should be the best solution, as long as we use the same method and we are not doing it on target columns

So the better code would also use fit_transform instead of just transform

By using the training set's median on both datasets, you're ensuring consistency. You're model learns patterns from your training data. If you're imputing a different median to your test set you're introducing information that the model hasn't seen during training. — Rodrigo Ferro, Aug 17 '23 at 20:49

score 0 · Accepted Answer · answered Aug 17 '23 at 20:37

0

I think it could lead to data leakage.

If you calculate the median of the entire dataset (I mean including the test set), then you are using information from the test set during training, which is a form of data leakage. The same applies if you calculate separate medians for train and test datasets it's still using information about the distribution of your test set during model creation.

So, while calculating separate medians might give better performance on your specific test set (because it more accurately represents that particular sample of data), it would likely perform worse on new unseen data because those preprocessing steps were tailored too closely to your original dataset.

It is the model's job to generalize well on new data. However, the model can only learn from the information it is given during training. If we use any information from our test set (or future unseen data) during training, we are giving our model an unfair advantage that won't exist when it's deployed in a real-world scenario.

This is why we always ensure that all preprocessing steps including imputation of missing values are learned solely from the training set. This way, we're simulating as closely as possible the conditions under which our model will be used in practice.

answered Aug 17 '23 at 20:37

Harshad Patil

822
1
2
13

Maybe my question is even simpler then. Let's assume we keep on doing median inputation as it performed perfectly. Then, when we get an unseen data, we just apply the imputation again (calculating the unseen data's median), since the column that has NaN value is a feature, not a target. Shouldn't the imputation be added to the workflow/pipeline? I think i'm missing a trivial thing here, can you enlighten me? – Guilherme Raibolt Aug 17 '23 at 20:45
Yes! You are right. If the median imputation works, and you want to go ahead with it then you have to put it in pipeline as well. When you get new unseen data (say, in a production environment), you would not calculate a new median from this data. Instead, you would use the same median value that was calculated from your training set during model development. – Harshad Patil Aug 17 '23 at 20:47
Also, in many real-world scenarios when predictions are needed for one observation at a time, calculating a meaningful median wouldn't even be possible because there's only one instance. – Harshad Patil Aug 17 '23 at 20:50
Your second answear convinced me. But still, let's create a scenario where i do daily predictions with all-day data (so N>1). Why shouldn't i calculate a new median? If i proved that median imputation is a decent approach, why should i use a previous median instead of calculating a new one, i can't see how that's more precise. Data is not leaking with this approach. In my point of view it's the opposite, using a past-measured median would worsen my predictions – Guilherme Raibolt Aug 17 '23 at 20:57
Your point is valid in a scenario where you have a large enough batch of new data each day and the underlying distribution of your data changes over time. In this case, recalculating the median daily could indeed be more accurate than using an old median from the training set. But there are some points that you need to consider. For e.g. If you recalculate the median every day, your model's behavior can change daily. This might be desirable if your data changes rapidly, but it also makes performance harder to track and debug. – Harshad Patil Aug 17 '23 at 21:04
In general, machine learning models work best when they're trained on a representative sample of all possible inputs they might encounter. If your daily batches of new data are different enough from your original training set that using an old median becomes inaccurate, it may indicate that your training set was not representative and/or that retraining more frequently would be beneficial. – Harshad Patil Aug 17 '23 at 21:08
I'm officially convinced, it would probably be an issue only on time series problems. Thanks a lot – Guilherme Raibolt Aug 17 '23 at 21:15
Happy to help you! – Harshad Patil Aug 17 '23 at 21:16

Imputation in train or test data

1 Answers1