I'm using SMOTENC to oversample an imbalanced-dataset.
I thought the point of SMOTENC was to give the option to oversample categorical features without one-hot encoding them. The reason I don't want to one-hot encode is to avoid Curse of Dimensionality and let CatBoost deal with the categorical features by defining the categorical features using the Pool Class.
However, when trying to oversample with SMOTENC I still get the error:
could not convert string to float
First, I perform some preprocessing on my numerical- and categorical features.
Preprocessing
numerical_transformer = Pipeline(
steps=[
("transformer", FunctionTransformer(lambda d: d.astype(np.float32))),
("imputer", SimpleImputer(strategy="mean")),
("scaler", MinMaxScaler()),
],
verbose=True,
)
categorical_transformer = Pipeline(
steps=[
("transformer", FunctionTransformer(lambda d: d.astype(str))),
("imputer", SimpleImputer(strategy="most_frequent")),
#("oh_encoder", OneHotEncoder(handle_unknown="ignore")),
],
verbose=True,
)
Second, my resampling transformer consist of first an undersampler and then an oversampler (cat_col_indices are the indices of my categorial features all having dtype "object"):
Resampling
resampling_coefficient = 0.6
resampling_transformer = Pipeline(
steps=[
(
"undersampler",
RandomUnderSampler(
sampling_strategy=resampling_coefficient
),
),
(
"oversampler",
SMOTENC(
categorical_features=cat_col_indices,
sampling_strategy="not majority",
k_neighbors=3,
n_jobs=16
),
),
],
verbose=True,
)
I preprocess my data and resample:
x_t = preprocessor.fit_transform(x)
x_t, y_t = resampling_transformer.fit_resample(x_t, y)
Resampling_transformer fit_resample function gives me:
ValueError: could not convert string to float: '(str)'
Do I still need to one-hot encode using SMOTENC? Or am I doing something wrong?`