Treating missing data in categorical features

Question

I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .

For eg :

column 1                        column2
========================================
Google                             -
Google                             -
Google                             -
Google                             -
Facebook                        Image
Facebook                        Video
Facebook                        Image

My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.

In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?

To me it depends and you have to make some test both with and without variable. Did you also try to merge `column 1` and `column 2` variables ? (In your example, you could make 3 variables `Google`, `FacebookImage` and `FacebookVideo`). That's another thing you can try to avoid having 2 highly correlated columns. — Adept, Aug 21 '20 at 14:19

score 3 · Answer 1 · answered Aug 23 '20 at 20:39

3

You could break the column 2 from your example into number of columns : Image,Video....

So the new features will be like:

Column1  Image  Video  
Google     0      0
Google     0      0
Facebook   1      0
Facebook   0      1

answered Aug 23 '20 at 20:39

Shiv

679
5
20

We can follow this method for all kinds of categorical columns? – Vikas Ukani Sep 19 '20 at 05:23
Suppose, There is an categorical feature in which there are too many unique values, For that, This method goes wrong, Right? – Vikas Ukani Sep 19 '20 at 05:24

score 2 · Answer 2 · answered Aug 24 '20 at 18:35

2

You can try this:

import pandas as pd

df_new = pd.get_dummies(df, columns=['column2'])
print(df_new)

Output:

    column1  column2_Image  column2_Video
0    Google              0              0
1    Google              0              0
2    Google              0              0
3    Google              0              0
4  Facebook              1              0
5  Facebook              0              1
6  Facebook              1              0

answered Aug 24 '20 at 18:35

Soumendra Mishra

262
2
12

What if there are many unique values in `column_2`, For Instance, `Image`, `Video`, `PDF`, `DOC`, `Excel`, `Audio`, etc. – Vikas Ukani Sep 19 '20 at 05:26
1

It will work. For example, if you add a new value (email), a new column will be added: `column2_Email column2_Image column2_Video` – Soumendra Mishra Sep 19 '20 at 06:37
Is there any disadvantages of too many features column, Suppose I use this method and I got 200+ feature in my DataFrame. So, There is and negative point of this kind of problem? – Vikas Ukani Sep 19 '20 at 06:51
1

There is no performance issues. It all depends on your use case. – Soumendra Mishra Sep 19 '20 at 06:54
@VikasUkani If you want to use one hot encoding, I would suggest to use `OneHotEncoder` function instead of `pd.get_dummies`. Both of them perform the same function but the advantage of `OneHotEncoder` is that during deployment, if an entirely new feature comes, then `pd.get_dummies` will give an error. But `OneHotEncoder` won't if you just specify the parameter `handle_unknown = 'ignore'`. – spectre Aug 06 '21 at 09:46
Also you can use the `drop = 'first'` parameter of `OneHotEncoder` to drop the first feature of the encoded variable thus reducing the dimensionality (to an extent). – spectre Aug 06 '21 at 09:48

Treating missing data in categorical features

2 Answers2