How to obtain original feature names after using one-hot encoding

Question

This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier().

How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()?

The problem is that the input features to the classifier are not the original ones - they are numerically encoded ones from pandas DataFrame get_dummies.

For example, I take the mushroom dataset from the UCI repository. Features in the dataset include - cap_shape, cap_surface, cap_color, odor, etc.

pandas dataframe getdummies encodes these into multiple features based on values of the original features. Say cap_shape has values b,c,f,k...after encoding new columns are cap_shape_b, cap_shape_c, cap_shape_f. Similar transformations happen for other features.

After training, the classifier tells me that the top two features are: cap_shape_b, cap_shape_c, cap_shape_f, odor_a,odor_c, odor_f,odor_l. From this result thrown by the classifier, I would like my function to return the original features, that is, cap_shape and odor.

score 1 · Answer 1 · answered Jun 29 '18 at 14:38

1

Consider using the one-hot encoder in category_encoders module for your encoding. It has an inverse_transform method which I believe will transform your one-hot encoded data back to its original form.

answered Jun 29 '18 at 14:38

bradS

1,547
7
19

score 0 · Answer 2 · edited Aug 29 '21 at 03:59

0

As shown in these docs at the section "Classification", you can export your tree using graphviz (it states that you have to install the graphviz package, too). And this way you're able to visualize the tree built by the algorithm. About the problem of the input features being transformed from the original ones, it's a problem the algorithm can't help you with but you should be able to manage that by yourself if you've made the transformations yourself.

Any further doubt, comment.

edited Aug 29 '21 at 03:59

Shayan Shafiq

1,012
4
11
24

answered Apr 29 '18 at 17:35

Felipe Bormann

401
2
9

Thank you for your reply. I have provided an example in the question. Hope this helps clarify what I am looking for. – S Datta Apr 30 '18 at 13:21
I saw your edit, if you build a mapping of the dummy variables you've created, you can create a function to return the original values but again, the classifier won't be able to predict based on the original values only the transformed features you've feed it on. – Felipe Bormann Apr 30 '18 at 13:45

score 0 · Answer 3 · edited Oct 27 '18 at 17:57

If you just need names of the original features you can use a regex to parse them out. You can easily decide a naming convention for transformed features (using the prefix parameter in get_dummies). After getting the scores, you can traverse the list of features in ascending/descending order and parse the column names using regex, use an ordered dict to store the results.

If you need the whole dataset transformed back, then go with the inverse_transform method mentioned in other answers.

How to obtain original feature names after using one-hot encoding

3 Answers3