Different number of features in train vs test when using Label Encoding

Question

This is not a duplicate of Different number of features in train vs test

There are some categorical columns in my data, and the cardinality for each of them is large, so I chose to use LabelEncoding over OneHotEncoding. However, some categories in the validation set do not appear in the training set, and they cause my model to perform very poorly. I was scraping the internet but couldn't find a way.

Is there any way I can use to solve this issue?

Thank you in advance.

score 1 · Answer 1 · answered Mar 30 '21 at 23:28

To solve the high cardinality problem, follow the recommendation of Erwan. To solve the second issue you are mentioning, fit the OneHotEncoder on the train set and use the transformer on the test set, like this:

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown = 'ignore')

enc.fit(train)

X_train = enc.transform(train)

X_test = enc.transform(test)

score 0 · Accepted Answer · answered Mar 30 '21 at 22:30

It's a mistake to use LabelEncoding for a categorical feature, it should be used only for a categorical target variable. This is because it converts values to integers, hence introducing an arbitrary order over the values.

It's not the values which don't appear in the training set which cause the poor performance (you can check), it's very likely because your model overfits: since there are many different values in the features, you would need a massive number of instances so that the model gets a representative sample for every value. Of course data is never like that, and it's clear from your description that some values occur too rarely (that's why some occur only in the test set).

The solution is to simplify the data, so that the model doesn't rely on patterns which appear by chance in the training set:

Replace values which appear rarely with a special value, e.g. RARE_VALUE. Try different thresholds for the minimum frequency.
Encode categorical features with one hot encoding (OHT).
Since the rare values were removed, the number of OHT features will be lower. In order to avoid overfitting, the ratio instances / features should be high enough.
In case there are still values in the test set which don't occur in the training set, replace them with your special value RARE_VALUE.

Thank you, Sir. You are right. When I check, it's is true that the unknown categories make up only 0.001% of the features of the validation set. Also, when I remove rows having them, the score does not improve either. — hamnghi, Mar 30 '21 at 23:33

Different number of features in train vs test when using Label Encoding

2 Answers2