5

I would like to reduce multiclass classification targets to binary classification targets. Ideally, this mapping would happen within scikit-learn so the same transformation applies during both training and prediction.

I looked at transforming the prediction target (y) documentation but did not see something that would work. Ideally, it would be a classifier version of TransformedTargetRegressor.

Something like this mapping:

targets_multi  = {'A', 'B', 'C', 'D'}
targets_binary = {0: {'A', 'B'},
                  1: {'C', 'D'}}
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • Do those two dictionaries exactly represent what you want to do? Why not just create a new dictionary with two new keys and so on? – WBM Feb 26 '21 at 11:27
  • I would like the functionality in native scikit-learn code so I get the benefits of the ecosystem, such as use in pipelines. – Brian Spiering Feb 26 '21 at 15:09
  • Still a bit unclear. You can create your own transformers to work in a pipeline, in your case convert `targets_multi` from a set into a list and then create your new dictionary `targets_binary` – WBM Feb 26 '21 at 15:29

1 Answers1

1

Of the three stated purposes of pipelines, you'd get the "convenience and encapsulation" one, but not the others:

  • Joint parameter selection: you don't have any parameters for this transformation.
  • Safety (from data leak): your transformation is context-specific, so there is no data leakage in applying it to the entire dataset up front.

This feels like something that is the definition of the targets, and is best considered a part of the data retrieval.


scikit-learn expects transform methods to have input just X and not y. For the most part, you can work around that by overriding fit_transform from TransformerMixin. However, nothing downstream will expect to get two return values (transformed X and y), so this won't work.

You can make a little more headway with the imbalanced-learn package, which provides its own Pipeline with more flexible transformation syntax. The purpose there is to implement resamplers, and that throws a major issue: resamplers do not apply at prediction time.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Ben Reiniger
  • 11,094
  • 3
  • 16
  • 53