I'm having a rather simple question.
Let's say i want to do a median imputation. I've read in some places that you should do:
imputer = SimpleImputer(strategy='median')
train_imputed = pd.DataFrame(imputer.fit_transform(train[feature_columns]), columns=train[feature_columns].columns)
test_imputed = pd.DataFrame(imputer.transform(test[feature_columns]), columns=test[feature_columns].columns)
But doesn't it means that i'm using the train median and applying it on the test dataset?
Wouldn't it be better to just calculate the test data's median and apply it on NaN values? What's wrong with that approach? In my opinion it should be the best solution, as long as we use the same method and we are not doing it on target columns
So the better code would also use fit_transform instead of just transform