3

I am currently working on the Boston problem hosted on Kaggle. The dataset is nothing like the Titanic dataset. There are many categorical columns and I'm trying to one-hot-encode these columns. I've decided to go with the column MSZoning to get the approach working and work out a strategy to apply it to other categorical columns. This is a small snippet of the dataset:

enter image description here

Here are the different types of values present in MSZoning, so obviously integer encoding only would be a bad idea:

['RL' 'RM' 'C (all)' 'FV' 'RH']

Here is my attempt on Python to assign MSZoning with the new one-hot-encoded data. I do know that one-hot-encoding turns each value into a column of its own and assigns binary values to each of them so I realize that this isn't exactly a good idea. I wanted to try it anyways:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")

labelEncoder = LabelEncoder()

train['MSZoning'] = labelEncoder.fit_transform(train['MSZoning'])
train_OHE = OneHotEncoder(categorical_features=train['MSZoning'])
train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()


print(train['MSZoning'])

Which is giving me the following (obvious) error:

C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 11, in <module>
    train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray()
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform
    self._handle_deprecations(X)
  File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 394, in _handle_deprecations
    n_features = X.shape[1]
IndexError: tuple index out of range

I did read through some Medium posts on this but they didn't exactly relate to what I was trying to do with my dataset as they were working with dummy data with a couple of categorical columns. What I want to know is, how do I make use of one-hot-encoding after the (attempted) step?

Andros Adrianopolos
  • 342
  • 1
  • 7
  • 18
  • 1
    Quick note: you have loaded the same dataframe for both `train` and `test` – Leevo Jun 10 '19 at 09:41
  • I've been displeased with how OneHotEncoder works based on hot LabelEncoder works. like the accepted answer, `pd.get_dummies` does OneHotEncoding without the (unnecessary) hassle of setting up the class. – MattR Jun 10 '19 at 20:25

2 Answers2

3

First of all, I noticed you have loaded the same dataframe for both train and test. Change the code like this:

import numpy as np
import pandas as pd

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

At this point, one-hot encode each variable you want with pandas' get_dummies() function:

# Onhe-hot encode a given variable
OHE_MSZoning = pd.get_dummies(train['MSZoning'])

It will be returned as a pandas dataframe. In my Jupyter Notebook it looks like this:

OHE_MSZoning.head()

enter image description here

You can repeat the same command for all the variables you want to one-hot encode.

Hope this helps, otherwise let me know.

Leevo
  • 6,005
  • 3
  • 14
  • 51
3

Here is an approach using the encoders from sklearn

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")
labelEncoder = LabelEncoder()
MSZoning_label = labelEncoder.fit_transform(train['MSZoning'])

The order mapping of classes and labels from sklearn's LabelEncoder can be seen from its classes_ property

labelEncoder.classes_
array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)
onehotEncoder = OneHotEncoder(n_values=len(labelEncoder.classes_))
MSZoning_onehot_sparse = onehotEncoder.fit_transform([MSZoning_label])
  • Convert MSZoning_onehot from sparse array to dense array
  • Reshape the dense array to be (n_classes,n_examples)
  • Convert from float to int type
MSZoning_onehot = MSZoning_onehot_sparse.toarray().reshape(len(MSZoning_label),-1).astype(int)

Pack it back into a data frame if you wan't

MSZoning_label_onehot = pd.DataFrame(MSZoning_onehot,columns=labelEncoder.classes_)
MSZoning_label_onehot.head(10)

enter image description here

dustindorroh
  • 141
  • 2
  • I don't get this line `array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)`. MSZoning is already type object. – Andros Adrianopolos Jun 11 '19 at 09:39
  • That is the output of the line above it. `In [1]: print(labelEncoder.classes_)` `Out[2]: array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)` – dustindorroh Jun 11 '19 at 09:52
  • When you pack it back into the dataframe, the dataframe isn't `train`. Shouldn't you submit your OHE variables back into the mother data? – Andros Adrianopolos Jun 11 '19 at 10:20
  • I created a new dataframe in the example, but you can add it back to the `train` dataframe if you like. The indexes between the two are mapped. – dustindorroh Jun 22 '19 at 07:44