We don't use Pipeline so we needed to just save the OneHotEncoder state for use in a different process. pickle is frowned up due to security and backward compatibility issues and hence by extension joblib is frowned upon, too. To be more precise, we were hoping to find a better/different way to save the encoder state from training and use it during scoring later. We ended up doing the following:
- Fit/transform during training as usual
- Harvest the ordered categories from the training-time encoder and save it as json/yaml.
- During production run, read the ordered categories and use those to hydrate a new encoded. It needs a bit of kick to get going (see example below).
- The kicker is that it allows us to alter the behavior of the encoder during production, e.g. make it resilient to hitherto unknown categories.
Following is an example code to demonstrate the approach:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
ds = pd.DataFrame(
{
"colors": ["red", "red", "white", "blue"],
"fruits": ["apple", "orange", "apple", "orange"],
}
)
This is our sample dataset
colors fruits
0 red apple
1 red orange
2 white apple
3 blue orange
Let's encode the data:
encoder_training = OneHotEncoder(sparse=False)
encoded_colors = encoder_training.fit_transform(np.array(ds.colors).reshape(-1, 1))
encoded_colors
The encoding is as follows:
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]])
As expected this one isn't tolerant to unknown categories.
encoder_training.transform(np.array("black").reshape(-1, 1))
This will give the following error.
ValueError: Found unknown categories ['black'] in column 0 during transform
After the fit, however, the encoder contains ordered list of categories that were used by the encoded to generate the encoding (and hence that are being used by the model)
color_categories = encoder_training.categories_
Here are its contents
color_categories
[array(['blue', 'red', 'white'], dtype=object)]
Now let's create another encoded for production/scoring. Note, however, that this one needs to have a different behavior for handling unknown values. Why? Because often the needs (and some times even the people responsible) for production might be different from that doing training. For example, if encoder is run on the entire training dataset including the test/holdout set then it would never have to deal with unknown categories.
encoder_production = OneHotEncoder(
handle_unknown="ignore", sparse=False, categories=color_categories
)
We have supplied it the ordered categories array that we had harvested from the training encoder. In practice this would be serialized as json/yaml during training and be loaded back in the production python process. Note: I'm using sparse=False only to make it easy to see the encoding in the example.
Oddly, though, we can't use this directly, first we need to bootstrap it! This can be anything. To avoid any type mismatch issues we simply use the 1st item from the categories list
encoder_production.fit(np.array(color_categories[0][0]).reshape(-1, 1))
Now our encoder it's ready for prime time!
encoder_production.transform(np.array("red").reshape(-1, 1))
This prints:
array([[0., 1., 0.]])
Unlike the training instance, this one is tolerant to unknown values
encoder_production.transform(np.array("black").reshape(-1, 1))
This prints:
array([[0., 0., 0.]])