Random Forest prediction fails due to unseen Features

Question

I have trained a Random Forest Model on some dataset and like to predict outcomes on other data which were not seen in training. When doing this, I get

ValueError: Number of features of the model must match the input. Model n_features is 12 and input n_features is 13

The problem is that there are some variables from the training data not existent in my prediction set. E.g. I capture the count of some feature via dummy variables D_0, D_1, D_2, D_3 indicating the number of occurences of D. I might have no D_2 in my training data but D_2 in my prediction data set.

What's best practice in such a case? I am planning to use this estimator repeatedly on future data and I can't know which features will be existent. Should I rather check for inconsistencies between both feature lists and manually correct those which do not overlap? In the above example, I'd code all occurences of D_2 to D_3 in order to align feature lists.

Hi, can you elaborate a bit what kind of features you are talking about and why your training set does not contain these (dummy) features? What is the context and purpose of your model? — Jonathan, Jan 15 '20 at 12:27
I think you have it the wrong way around - the model expects 12 features but you are feeding it data with 13 features... which implies that your training data is smaller than your prediction data. — bradS, Jan 15 '20 at 12:29
https://datascience.stackexchange.com/q/54052/55122 , https://datascience.stackexchange.com/q/47140/55122 , https://datascience.stackexchange.com/q/56331/55122 — Ben Reiniger, Jan 15 '20 at 15:04

Blenz · Accepted Answer · 2020-01-15T12:40:29.557

Problem is the way you're onehot encoding.

Best practice for any type of encoding :

You should train an estimator for Onehot encoding on the training data only, and when encoding test data, you should use the same estimator used on training data.

Eg : sklearn.preprocessing.OneHotEncoder does this, and it has a parameter called : handle_unknown.

handle_unknown{‘error’, ‘ignore’}, default=’error’ Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Optimal option is : You could use this parameter and set it to ignore, in order to ignore the unknown feature value and avoid an error, until you retrain your model eventually and add the new feature values to your model.

from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(handle_unknown='ignore')
train=ohe.fit_transform(train)
test=ohe.transform(test)

Or you could , as you said , manually correct differences in the feature space, but would be time-consuming at each update of your model, without excluding the possibility of your code raising an error if you're sloppy in your manual correction.

thanks so much, the `handle_unkown` method is what I was looking for. — E. Sommer, Jan 15 '20 at 12:42

Random Forest prediction fails due to unseen Features

1 Answers1