SMOTE before categorical encoding vs SMOTE after categorical encoding

Question

I have a small dataset of 977 rows with a class proportion of 77:23.

For the sake of metrics improvement, I have kept my minority class ('default') as class 1 (and 'not default' as class 0).

My input variables are categorical (high cardinality) in nature. So, the below is what I tried.

Approach 1

a) Split train and test

b) Apply SMOTE-NC on train data

c) Later, rare_encode and ordinal_encode train and test data separately (due to high cardinal input variables)

d) Build RF model with gridsearch and stratified cross validation, optimizing recall score.

e) Assess the performance

f) No improvement (when compared to imbalanced class. No use of SMOTE NC)

Approach 2

a) Split train and test

b) rare_encode and ordinal_encode train and test data separately (due to high cardinal input variables)

c) SMOTE-NC the train data only

d) Build RF model with gridsearch and stratified cross validation, optimizing recall score.

e) Assess the performance

f) No improvement (when compared to imbalanced class. No use of SMOTE NC)

So, my questions are as follows

a) why there is no improvement of the model especially in test data?

b) Am I doing anything incorrectly with the way am doing SMOTE and encoding of categorical variables?

As I posted for you a few hours ago, class imbalance isn’t much of a problem, so it will be important to say why you find it problematic in order for us to help you. — Dave, Feb 20 '22 at 19:41
See this thread https://twitter.com/CarlosMougan/status/1475756319999205377 — Carlos Mougan, Feb 20 '22 at 21:32
This question in DSE https://datascience.stackexchange.com/questions/106461/why-smote-is-not-used-in-prize-winning-kaggle-solutions/108363#108363 — Carlos Mougan, Feb 20 '22 at 21:35
https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — Carlos Mougan, Feb 20 '22 at 21:35
Okay, understand. I would like to increase the performance of my nodel in identifying the minority class. I already exhausted feature engineering to the best of my knowlege. I also tried using class weight parameter..but as the performance of minority class didn't go up, i was trying SMOTE as a last option. Is there any tips that you have on what more can i try to improve the model performance? — The Great, Feb 21 '22 at 00:31
And we don't wish to change the default 0.5 threshold. Is there any other tricks that we can try? — The Great, Feb 21 '22 at 00:32
Another point is, in train set, my recall for minority class is good. Meaning, it is above 80% but in test set, it is only 40%..huge drop — The Great, Feb 21 '22 at 00:53
Why can’t you change the default threshold (or do away with a threshold at all)? — Dave, Feb 21 '22 at 02:31
what do you mean by do away with threshold at all? Sorry, my english knowledge is limited. Modifying threshold seems like a convenient approach to increase the recall of minority class. Meaning, there is no logic/statistics behind it. It is just me simply changing the threshold from 0.5 to 0.6,0.7,0.8 or 0.9 etc to get what I want. When my model works well on train data with a threshold of 0.5, I was expecting the same to happen on test data (with acceptable drop in performance) — The Great, Feb 21 '22 at 02:46
Here are the usual links I post on this topic. Briefly, by doing away with the threshold altogether, you evaluate the probability predictions. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Feb 21 '22 at 02:54

SMOTE before categorical encoding vs SMOTE after categorical encoding

0 Answers0