5

Let's say I have a categorical feature (cat):

import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold

random.seed(1234)
y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})

and I want to use target encoding with regularisation using CV like below:

X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42)
df_train = pd.concat([X_train, y_train], axis=1).sort_index()
df_train["kfold"] = -1
idx = df_train.index
df_train = df_train.sample(frac=1)

skf = StratifiedKFold(n_splits=5)
for fold_id, (train_id, val_id) in enumerate(skf.split(X=df_train.drop("y", axis=1), y=df_train["y"])):
    df_train.iloc[val_id, df_train.columns.get_loc("kfold")] = fold_id

df_train = df_train.loc[idx]

encoded_dfs = []

for fold in df_train["kfold"].unique():
    df_train_cv = df_train[df_train["kfold"] != fold].copy()
    df_val_cv = df_train[df_train["kfold"] == fold].copy()

    means = df_train_cv.groupby('cat')['y'].mean()
    df_val_cv['cat'] = df_val_cv['cat'].map(means)
    encoded_dfs.append(df_val_cv)

encoded_dfs = pd.concat(encoded_dfs, axis=0).sort_index()
encoded_dfs.drop('kfold', axis=1, inplace=True)

However, I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set:

means = df_train.groupby('cat')['y'].mean()
X_test['cat'] = X_test['cat'].map(means)

It seems to be the natural way to do it as, in fact, this is exactly mimicked by CV step. But the results of the model I got were off and it made me think if I am missing something. Please note that, for sake of simplicity, I omitted additional smoothing I did as well. Therefore, my question is: is it the correct way to encode test set?

Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
Xaume
  • 182
  • 2
  • 11

1 Answers1

7

I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set

Yep, that seems fine, they way that you do it there its a bit more complicated than using a pipeline. The idea of splitting into train and test is mimicking how the model will behave in production/unseen data. Doing target encoding with the test, is doing data leakage and getting a miss representation of how the model will behave in production. So you get the target values in train and then move to test.

If you do this, and then you have a category in test that is unseen, it will through an error. If you have a look at the target encoding library of category encoders, you can deal with this.:

handle_missing: str options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

You can handle it in different ways, the best is depending in your problem. The default is returning the target mean.

They best practice to do is to create a pipeline where the target encoding is a step(transformer). This will allow you to do CV, evaluate your model on test and many other functionalities. (Here a tutorial on how to)

A code snippet:

import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from category_encoders.target_encoder import TargetEncoder
from sklearn.linear_model import LogisticRegression

random.seed(1234)

y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})

X_train, X_test, y_train, y_test = train_test_split(
    df[["cat"]], df["y"], train_size=0.8, random_state=42
)
skf = StratifiedKFold(n_splits=5)

te = TargetEncoder()
clf = LogisticRegression()

pipe = Pipeline(
    [
        ("te", te),
        ("clf", clf),
    ]
)

# Grid to serch for the hyper parameters
pipe_grid = {
    "te__smoothing": [0.0001],
}

# Instantiate the grid
pipe_cv = GridSearchCV(
    pipe,
    param_grid=pipe_grid,
    n_jobs=-1,
    cv=skf,
)

pipe_cv.fit(X_train, y_train)

# Add some unseen category to the test.
X_test["cat"] = "UUUUU"

pipe_cv.predict(X_test)

Note that the code is not optimal but it should show you how to deal with this problem of doing target encoding with the train and test using a pipeline, and working with unseen data :)

Note that the category has been assigned randomly. So the model detects that the best is predicting the most frequent class. If you change for ElasticNet (a regressor) you will get the mean.

If you take out the unseen category assignation to test you will still get the same results

Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
  • But OPs is asking for the test set and he did that using the train set. So seems fine. Isn't it – 10xAI Sep 09 '20 at 06:37
  • Its kind of fine, what happens if you have then unseen categories in test? – Carlos Mougan Sep 09 '20 at 06:56
  • Yes, I agree with the idea of Pipeline. Just wanted to assure you didn't miss-read the question. Thanks!!! – 10xAI Sep 09 '20 at 07:25
  • Thanks @CarlosMougan. In the meantime I bumped into another way of encoding test set - just take the mean of mapped values during cv within each category. So assuming we deal with 5-fold CV on the train data, for each category we take the mean of 5 different values. As a side question: are there any pros and cons of that method compared to the one described earlier? – Xaume Sep 13 '20 at 08:42
  • @Xaume using a pipeline is more solid. As you see you need less code and its better structured. Using pipelines is one of the best practice when building ML models. – Carlos Mougan Sep 13 '20 at 10:58
  • 1
    @Xaume, a question about your method, if you have 5 CV folds, you will have 5 means per category, which will you use? – Carlos Mougan Sep 13 '20 at 10:59
  • 1
    A mean of that 5 means. And yeah, pipelines looks very clear and foolproof indeed - thanks for suggestion. But I'm still wondering what would be the better way to encode the test set. – Xaume Sep 14 '20 at 06:48
  • @Xaume Supervised learning idea is based in that your train data is as close as possible to the train data. The best way is to make sure that nothing is left unseen, and in the case you can not do that, I normally replace by the general mean, or depending on the problem a fixed value/NaN – Carlos Mougan Sep 14 '20 at 08:19
  • i dont' understand why you would do target encoding within the cross validation method? why is this, so in each fold in this case 5 isn't the model only being trained on 5 differently encoded folds ? i don't understand how the cv loop is helping in this case @CarlosMougan – Maths12 Dec 07 '20 at 17:49
  • 1
    @Maths12 you do target encoding it within the CV and in the pipeline, to search for hyperparameters and to avoid data leakage in the CV folds – Carlos Mougan Dec 09 '20 at 08:37