one-hot-encoding categorical data gives error

Question

I am currently working on the Boston problem hosted on Kaggle. The dataset is nothing like the Titanic dataset. There are many categorical columns and I'm trying to one-hot-encode these columns. I've decided to go with the column MSZoning to get the approach working and work out a strategy to apply it to other categorical columns. This is a small snippet of the dataset:

Here are the different types of values present in MSZoning, so obviously integer encoding only would be a bad idea:

['RL' 'RM' 'C (all)' 'FV' 'RH']

Here is my attempt on Python to assign MSZoning with the new one-hot-encoded data. I do know that one-hot-encoding turns each value into a column of its own and assigns binary values to each of them so I realize that this isn't exactly a good idea. I wanted to try it anyways:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")

labelEncoder = LabelEncoder()

train['MSZoning'] = labelEncoder.fit_transform(train['MSZoning'])
train_OHE = OneHotEncoder(categorical_features=train['MSZoning'])
train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()


print(train['MSZoning'])

Which is giving me the following (obvious) error:

C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 11, in <module>
    train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform
    self._handle_deprecations(X)
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 394, in _handle_deprecations
    n_features = X.shape[1]
IndexError: tuple index out of range

I did read through some Medium posts on this but they didn't exactly relate to what I was trying to do with my dataset as they were working with dummy data with a couple of categorical columns. What I want to know is, how do I make use of one-hot-encoding after the (attempted) step?

Quick note: you have loaded the same dataframe for both `train` and `test` — Leevo, Jun 10 '19 at 09:41
I've been displeased with how OneHotEncoder works based on hot LabelEncoder works. like the accepted answer, `pd.get_dummies` does OneHotEncoding without the (unnecessary) hassle of setting up the class. — MattR, Jun 10 '19 at 20:25

score 3 · Accepted Answer · answered Jun 10 '19 at 09:48

3

First of all, I noticed you have loaded the same dataframe for both train and test. Change the code like this:

import numpy as np
import pandas as pd

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

At this point, one-hot encode each variable you want with pandas' get_dummies() function:

# Onhe-hot encode a given variable
OHE_MSZoning = pd.get_dummies(train['MSZoning'])

It will be returned as a pandas dataframe. In my Jupyter Notebook it looks like this:

OHE_MSZoning.head()

You can repeat the same command for all the variables you want to one-hot encode.

Hope this helps, otherwise let me know.

answered Jun 10 '19 at 09:48

Leevo

6,005
3
14
51

1

How come you're using `pandas.get_dummies()` over the sklearn function? – Andros Adrianopolos Jun 10 '19 at 09:57
1

It's the method I'm used to, I work all the time with pandas dataframes and I find it useful. But it's not necessarily better than sklearn. I used this because I'm sure it worked. – Leevo Jun 10 '19 at 10:51
I'll definitely give it a try. Thank you. I'll accept your answer. If you think that this was a well asked question, could you give me an upvote? – Andros Adrianopolos Jun 10 '19 at 10:52
So do you just create a new variable for every instance of this? This dataset has many categorical variables. – Andros Adrianopolos Jun 11 '19 at 06:43

score 3 · Answer 2 · answered Jun 10 '19 at 11:31

Here is an approach using the encoders from sklearn

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

labelEncoder = LabelEncoder()
MSZoning_label = labelEncoder.fit_transform(train['MSZoning'])

The order mapping of classes and labels from sklearn's LabelEncoder can be seen from its classes_ property

labelEncoder.classes_

array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)

onehotEncoder = OneHotEncoder(n_values=len(labelEncoder.classes_))
MSZoning_onehot_sparse = onehotEncoder.fit_transform([MSZoning_label])

Convert MSZoning_onehot from sparse array to dense array
Reshape the dense array to be (n_classes,n_examples)
Convert from float to int type

MSZoning_onehot = MSZoning_onehot_sparse.toarray().reshape(len(MSZoning_label),-1).astype(int)

Pack it back into a data frame if you wan't

MSZoning_label_onehot = pd.DataFrame(MSZoning_onehot,columns=labelEncoder.classes_)
MSZoning_label_onehot.head(10)

I don't get this line `array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)`. MSZoning is already type object. — Andros Adrianopolos, Jun 11 '19 at 09:39
That is the output of the line above it. `In [1]: print(labelEncoder.classes_)` `Out[2]: array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)` — dustindorroh, Jun 11 '19 at 09:52
When you pack it back into the dataframe, the dataframe isn't `train`. Shouldn't you submit your OHE variables back into the mother data? — Andros Adrianopolos, Jun 11 '19 at 10:20
I created a new dataframe in the example, but you can add it back to the `train` dataframe if you like. The indexes between the two are mapped. — dustindorroh, Jun 22 '19 at 07:44

one-hot-encoding categorical data gives error

2 Answers2

Linked