Say I have a multiclass problem with a dataset as this:
user_id price target
-------+--------+-----
1 30 apple
1 20 samsung
2 32 samsung
2 40 huawei
.
.
where I have a lot of users i.e One Hot Encoding (OHE) is not doable. Target-encoders such as CatBoost have achieved great results for target encoding categorical features. The issue is, I can only find CatBoost/target-encoder for binary classification/regression (which makes sense, in some way).
Right now I have overcommed the issue by target-encode each class in the target, by OHE the target (since it often has fewer categories), and then "target-encoded" the user_id for each OHE encoding e.g with these two steps:
- OHE the target
user_id price ohe_apple ohe_samsung ohe_huawei
-------+-------+-----------+-------------+------------
1 30 1 0 0
1 20 0 1 0
2 32 0 1 0
2 40 0 0 1
- target-encode
user_idfor eachohe_-column (the numbers are just made up for the illustration):
price user_id_apple user_id_samsung user_id_huawei
-----+--------------+----------------+---------------
30 0.5 0.6 0
20 0.5 0.6 0
32 0 1 0
40 0 0 1
.
.
I have achieved a notable increase of performance in e.g neural network with this approach (where OHE cannot fit into the GPU) but I wonder if there is some target-encoder for high-cardinality features (like user-id) in a multiclass classification, for e.g CatBoost?