Im trying to use gridsearch to find the best parameter for my model. Knowing that I have to implement nearmiss undersampling method while doing cross validation, should I fit my gridsearch on my undersampled dataset (no matter which under sampling techniques) or on my entire training data (whole dataset) before using cross validation?
- 1,908
- 2
- 13
- 23
- 35
- 4
-
2Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jul 19 '21 at 00:43
-
I would like to know for grid search from where do you get the initial parameters? Is it some kind of random guess? – Encipher Oct 24 '22 at 04:19
-
@Encipher, for grid search you need to refer to the model parameters (check it on sklearn). Then you can build your grid search dictionary with the different values you want to try for each parameter of your model – Valentin Nov 07 '22 at 09:07
2 Answers
You can create the grid search - cross validation manually instead of using GridsSearchCV and for each split upsample the rarest class only for the folds used for training. The fold used for validation stays the same. This because if you upsample all the training set before the cross validation, you will have a too optimistic result for the validation error. In both the training portion and validation portion of the split there would be the same duplicated instances (and possibly plenty of them)
- 1
- 1
Do grid search on the same Level of "imbalancedeness" that you plan/are able to do your Training and Evaluation on.
So that means that if you saw that imbalanced data set does not skew your model predictions or results in other unwanted Outcomes, done use the maximal dataset possible. But on the other Hand if your model is strongly overfitting because of imbalanced dataset then optimisation with grid search will make him overfit more in that direction.
- 5,609
- 1
- 11
- 26
-
Thank you very much for your answer ! So if my dataset is imbalance, there is no point using grid search on a resampled dataset right ? – Valentin Feb 16 '21 at 15:28