Obtaining consistent one-hot encoding of train / production data

Question

I'm building an app that will require user input. Currently, on the training set, I run the following code, in which data is a pandas dataframe with a combination of categorical and numerical columns.

dummified_data = data.get_dummies()
train_data = dummified_data[:10000]
test_data = dummified_data[10000:12000]

Currently, I have a hand-written function that takes user-inputted data and transforms it into a format like dummy data. This doesn't seem sustainable as the number of columns/the size of my categorical variables grows.

Is there a way to dummify training data and production data consistently?

See also https://stackoverflow.com/q/54786266/10495893 – Ben Reiniger Aug 24 '19 at 01:46 — Ben Reiniger, Aug 24 '19 at 01:46

score 7 · Answer 1 · answered Jun 19 '19 at 09:05

If I understand your question correctly you want to make sure that the order of the encoding is always the same. Have you tried sklearn- to be more specific sklearn.preprocessing.OneHotEncoder?

The way it works is that you fit (or fit_transform) on your training sample. Then you save the state of your encoder (for example you can pickle it). In production you then load this encoder and transform. This should then provide the same results.

If for some reason you want to stay away from scikit you can also create a dict from your training sample and then apply this dict on production - this should be possible with pandas built in functionality.

I followed this same procedure, trying to make prediction on smaller subset, but for some reason this saved encoder is giving very less number of columns than it should. I understand smaller subset would have less categories but should not atleast number of columns be same for saved encoder here? I am saving it encdoer during training after fit_transform and then applying in another script for smaller subset. — dan, Oct 25 '21 at 11:10

Blenz · Accepted Answer · 2019-06-21T08:55:24.303

Use sklearn.preprocessing.OneHotEncoder and transfer the one-hot encoding to your web-service ( i'm guessing that's how you're using the model for inference ) via sklearn.pipeline.Pipeline. The pipeline will save the state of your fit on your training data and apply the same function on your production data.

Example :

pipeline1 = Pipeline([
                ('OneHotEncoder', OneHotEncoder())
            ])
pipeline1.fit(trainingdata.column1.values.reshape(-1,1))

This is how you create a pipeline containing the onehotencoder , fit your data on the pipeline. All is left is dumping your pipeline in a file, loading it later in your production environment, and call the transform method on your loaded pipeline :

joblib.dump(pipeline1,"pipeline1.joblib")
# Production environment
pipeline1 = joblib.load('pipeline1.joblib')
momo = pipeline1.transform(productiondata.column1.values.reshape(-1,1)).toarray()

And here , the variable momo contains your production data with the pipeline ( containing the one-hot encoding operation ) applied to it.

What if we have to encode more than one column? – moarra Feb 16 '23 at 16:51 — moarra, Feb 16 '23 at 16:51

score 3 · Answer 3 · answered Oct 30 '19 at 23:59

We don't use Pipeline so we needed to just save the OneHotEncoder state for use in a different process. pickle is frowned up due to security and backward compatibility issues and hence by extension joblib is frowned upon, too. To be more precise, we were hoping to find a better/different way to save the encoder state from training and use it during scoring later. We ended up doing the following:

Fit/transform during training as usual
Harvest the ordered categories from the training-time encoder and save it as json/yaml.
During production run, read the ordered categories and use those to hydrate a new encoded. It needs a bit of kick to get going (see example below).
The kicker is that it allows us to alter the behavior of the encoder during production, e.g. make it resilient to hitherto unknown categories.

Following is an example code to demonstrate the approach:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

ds = pd.DataFrame(
    {
        "colors": ["red", "red", "white", "blue"],
        "fruits": ["apple", "orange", "apple", "orange"],
    }
)

This is our sample dataset

colors  fruits
0   red apple
1   red orange
2   white   apple
3   blue    orange

Let's encode the data:

encoder_training = OneHotEncoder(sparse=False)
encoded_colors = encoder_training.fit_transform(np.array(ds.colors).reshape(-1, 1))
encoded_colors

The encoding is as follows:

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

As expected this one isn't tolerant to unknown categories.

encoder_training.transform(np.array("black").reshape(-1, 1))

This will give the following error.

ValueError: Found unknown categories ['black'] in column 0 during transform

After the fit, however, the encoder contains ordered list of categories that were used by the encoded to generate the encoding (and hence that are being used by the model)

color_categories = encoder_training.categories_

Here are its contents

color_categories

[array(['blue', 'red', 'white'], dtype=object)]

Now let's create another encoded for production/scoring. Note, however, that this one needs to have a different behavior for handling unknown values. Why? Because often the needs (and some times even the people responsible) for production might be different from that doing training. For example, if encoder is run on the entire training dataset including the test/holdout set then it would never have to deal with unknown categories.

encoder_production = OneHotEncoder(
    handle_unknown="ignore", sparse=False, categories=color_categories
)

We have supplied it the ordered categories array that we had harvested from the training encoder. In practice this would be serialized as json/yaml during training and be loaded back in the production python process. Note: I'm using sparse=False only to make it easy to see the encoding in the example.

Oddly, though, we can't use this directly, first we need to bootstrap it! This can be anything. To avoid any type mismatch issues we simply use the 1st item from the categories list

encoder_production.fit(np.array(color_categories[0][0]).reshape(-1, 1))

Now our encoder it's ready for prime time!

encoder_production.transform(np.array("red").reshape(-1, 1))

This prints:

array([[0., 1., 0.]])

Unlike the training instance, this one is tolerant to unknown values

encoder_production.transform(np.array("black").reshape(-1, 1))

This prints:

array([[0., 0., 0.]])

Obtaining consistent one-hot encoding of train / production data

3 Answers3

Linked

Related