4

I am working on the Boston challenge hosted on Kaggle and I'm still refining my features. Looking at the dataset, I realize that some columns need to be encoded in binary, some encoded in decimals (ranking them out of a scale of n) and some need to be one-hot-encoded. I've collected these columns and categorized them in distinct lists (at least based on my judgement on how their data should be encoded):

categorical_columns = ['MSSubClass', 'MSZoning', 'Alley', 'LandContour', 'Neighborhood', 'Condition1', 'Condition2',
                       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating',
                       'Functional', 'GarageType', 'PavedDrive', 'SaleType', 'SaleCondition']

binary_columns = ['Street', 'CentralAir']

ranked_columns = ['LotShape', 'Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
                  'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
                  'GareFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

One fellow stackexchange user suggested that I use pandas.get_dummies() method to one-hot-encode categorical variables like MSZoning and attach it to a variable like this:

OHE_MSZoning = pd.get_dummies(train['MSZoning'])

I'd like to know how I can automate this process using functions and control-flow statements like for-loop.

Andros Adrianopolos
  • 342
  • 1
  • 7
  • 18

1 Answers1

2

I'm the fellow Stackexchange user, hi! I wrote the function that iterates the one-hot encoding on all your categorical_columns:

def serial_OHE(df, categorical_columns):

    # iterate on each categorical column
    for col in categorical_columns:

        # take one-hot encoding
        OHE_sdf = pd.get_dummies(df[col])

        # drop the old categorical column from original df
        df.drop(col, axis=1, inplace=True)

        # attach one-hot encoded columns to original dataframe
        df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    return df

So you can call it like this:

df = serial_OHE(df, categorical_columns)

Let me know it there are any problems.

Leevo
  • 6,005
  • 3
  • 14
  • 51
  • So this is returning an entire new training dataset? – Andros Adrianopolos Jun 12 '19 at 10:47
  • it is returning a new dataset in which each `categorical_column` has been substituted by its one-hot encoded counterpart. – Leevo Jun 12 '19 at 10:55
  • Ah okay. Thank you very much. I'll accept and upvote your answer. If you think that this was a well asked question, could you give me an upvote as well? – Andros Adrianopolos Jun 12 '19 at 11:01
  • I gotta do this for binary and decimal encoding as well. How can I do all 3 in the same function? It just means that I need to reassign the dataset 3 times. – Andros Adrianopolos Jun 12 '19 at 11:29
  • It should work with others lists as well, I think. Try to call `df = serial_OHE(df, binary_columns)` and `df = serial_OHE(df, ranked_columns)`, and let me know if it works. – Leevo Jun 12 '19 at 12:41
  • If you don't mind, could you break this line down for me `df.drop(col, axis=1, inplace=True)`. Idg the relations of the parameters. – Andros Adrianopolos Jun 13 '19 at 05:44
  • 1
    Sure. The first part: `df.drop(col)` tells `df` to drop what corresponds to `col`. The `axis=1` argument says: "I want you to drop a column (axis=1) and not a row (that would be axis=0)". Finally, `inplace=True` means: "the new dataframe that you get after the column drop, is the new `df`, i.e. substitute it to the original `df`". – Leevo Jun 13 '19 at 07:12