SMOTENC oversampling without one-hot encoding

Question

I'm using SMOTENC to oversample an imbalanced-dataset.

I thought the point of SMOTENC was to give the option to oversample categorical features without one-hot encoding them. The reason I don't want to one-hot encode is to avoid Curse of Dimensionality and let CatBoost deal with the categorical features by defining the categorical features using the Pool Class.

However, when trying to oversample with SMOTENC I still get the error:

could not convert string to float

First, I perform some preprocessing on my numerical- and categorical features.

Preprocessing

     numerical_transformer = Pipeline(
            steps=[
                ("transformer", FunctionTransformer(lambda d: d.astype(np.float32))),
                ("imputer", SimpleImputer(strategy="mean")),
                ("scaler", MinMaxScaler()),
            ],
            verbose=True,
        )
    
        categorical_transformer = Pipeline(
            steps=[
                ("transformer", FunctionTransformer(lambda d: d.astype(str))),
                ("imputer", SimpleImputer(strategy="most_frequent")),
                #("oh_encoder", OneHotEncoder(handle_unknown="ignore")),
            ],
            verbose=True,
        )

Second, my resampling transformer consist of first an undersampler and then an oversampler (cat_col_indices are the indices of my categorial features all having dtype "object"):

Resampling

    resampling_coefficient = 0.6
    resampling_transformer = Pipeline(
        steps=[
            (
                "undersampler",
                    RandomUnderSampler(
                        sampling_strategy=resampling_coefficient
                    ),
            ),
            (
                "oversampler",
                    SMOTENC(
                        categorical_features=cat_col_indices,
                        sampling_strategy="not majority", 
                        k_neighbors=3, 
                        n_jobs=16
                    ),
            ),
        ],
        verbose=True,
    )

I preprocess my data and resample:

    x_t = preprocessor.fit_transform(x)
    x_t, y_t = resampling_transformer.fit_resample(x_t, y)

Resampling_transformer fit_resample function gives me:

ValueError: could not convert string to float: '(str)'

Do I still need to one-hot encode using SMOTENC? Or am I doing something wrong?`

Aidan Wade · Answer 1 · 2023-04-27T19:05:48.260

I had this same scenario - I fixed it by converting the Category columns to 'object' instead of the 'category' type.

X[categorical_features]=X[categorical_features].astype("object")

Edit: Another possible cause, SMOTENC wants indices instead of the names of categorical attributes but other transformers may be reordering the dataframe columns (numeric scalers for example) so the indices are wrong when the pipeline gets to SMOTENC

SMOTENC oversampling without one-hot encoding

Preprocessing

Resampling

1 Answers1