Reduce multiclass classification targets to binary classification targets in scikit-learn

Question

I would like to reduce multiclass classification targets to binary classification targets. Ideally, this mapping would happen within scikit-learn so the same transformation applies during both training and prediction.

I looked at transforming the prediction target (y) documentation but did not see something that would work. Ideally, it would be a classifier version of TransformedTargetRegressor.

Something like this mapping:

targets_multi  = {'A', 'B', 'C', 'D'}
targets_binary = {0: {'A', 'B'},
                  1: {'C', 'D'}}

Do those two dictionaries exactly represent what you want to do? Why not just create a new dictionary with two new keys and so on? — WBM, Feb 26 '21 at 11:27
I would like the functionality in native scikit-learn code so I get the benefits of the ecosystem, such as use in pipelines. — Brian Spiering, Feb 26 '21 at 15:09
Still a bit unclear. You can create your own transformers to work in a pipeline, in your case convert `targets_multi` from a set into a list and then create your new dictionary `targets_binary` — WBM, Feb 26 '21 at 15:29

score 1 · Answer 1 · edited Aug 13 '21 at 23:10

Of the three stated purposes of pipelines, you'd get the "convenience and encapsulation" one, but not the others:

Joint parameter selection: you don't have any parameters for this transformation.
Safety (from data leak): your transformation is context-specific, so there is no data leakage in applying it to the entire dataset up front.

This feels like something that is the definition of the targets, and is best considered a part of the data retrieval.

scikit-learn expects transform methods to have input just X and not y. For the most part, you can work around that by overriding fit_transform from TransformerMixin. However, nothing downstream will expect to get two return values (transformed X and y), so this won't work.

You can make a little more headway with the imbalanced-learn package, which provides its own Pipeline with more flexible transformation syntax. The purpose there is to implement resamplers, and that throws a major issue: resamplers do not apply at prediction time.

Reduce multiclass classification targets to binary classification targets in scikit-learn

1 Answers1