2

Say I have a multiclass problem with a dataset as this:


user_id  price   target
-------+--------+-----
1         30      apple
1         20      samsung
2         32      samsung
2         40      huawei
.
.

where I have a lot of users i.e One Hot Encoding (OHE) is not doable. Target-encoders such as CatBoost have achieved great results for target encoding categorical features. The issue is, I can only find CatBoost/target-encoder for binary classification/regression (which makes sense, in some way).

Right now I have overcommed the issue by target-encode each class in the target, by OHE the target (since it often has fewer categories), and then "target-encoded" the user_id for each OHE encoding e.g with these two steps:

  1. OHE the target
user_id  price   ohe_apple   ohe_samsung   ohe_huawei
-------+-------+-----------+-------------+------------
1         30       1             0             0
1         20       0             1             0
2         32       0             1             0
2         40       0             0             1
  1. target-encode user_id for each ohe_-column (the numbers are just made up for the illustration):

price  user_id_apple user_id_samsung  user_id_huawei
-----+--------------+----------------+---------------
30         0.5            0.6             0
20         0.5            0.6             0
32         0               1              0
40         0               0              1
.
.

I have achieved a notable increase of performance in e.g neural network with this approach (where OHE cannot fit into the GPU) but I wonder if there is some target-encoder for high-cardinality features (like user-id) in a multiclass classification, for e.g CatBoost?

CutePoison
  • 450
  • 2
  • 8

1 Answers1

1

Feature hashing, such as category_encoders HashingEncoder() is widely applicable in such cases, with a controllable feature size/information loss tradeoff.

category_encoders also supplies a PolynomialWrapper(), automating the extension of binary target encoders to multiclass (still using OHE on the target inside).

Edit: valid point, hashing is target agnostic, so to speak. It is a better option than OHE still, and category_encoders devs even recommend it in a case of a huge amount of classes.

dx2-66
  • 666
  • 2
  • 11