2

I already referred this post here but there is no answer.

I am working on a binary classification using a random forest classifier. My dataset shape is (977,8) with 77:23 class proportion. My system has 4 cores and 8 logical processors.

As my dataset is imbalanced, I used Balancedbaggingclassifier (with random forest as an estimator).

Therefore, I used gridsearchCV to identify the best parameters of balancedbagging classifier model to train/fit the model and then predict.

My code looks like below

n_estimators = [100, 300, 500, 800, 1200]
max_samples = [5, 10, 25, 50, 100]
max_features = [1, 2, 5, 10, 13]
hyperbag = dict(n_estimators = n_estimators, max_samples = max_samples, 
              max_features = max_features)
skf = StratifiedKFold(n_splits=10, shuffle=False)
gridbag = GridSearchCV(rf_boruta,hyperbag,cv = skf,scoring='f1',verbose = 3, n_jobs=-1)
gridbag.fit(ord_train_t, y_train)

However, the logs that are generated in jupyter console, have below messages where the gridsearchcv score is nan for some cv executions as shown below.

You can see that for some of the CV executions, the gridscore is nan. can help me please? And it keeps running for more than half an hour and no output yet

Why does gridsearchCV return nan?

[CV 10/10] END max_features=1, max_samples=25, n_estimators=500;, score=nan total time= 4.5min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.596 total time=10.4min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.622 total time=10.4min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.456 total time=10.5min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.519 total time=10.5min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 3.3min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 9.9min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.0min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time=10.7min
[CV 1/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.652 total time=16.4min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.6min
[CV 2/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.528 total time=16.6min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.571 total time=16.4min
[CV 7/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.553 total time=16.1min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 6.7min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 1.7min
[CV 10/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.489 total time=16.0min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time=18.6min
[CV 1/10] END max_features=1, max_samples=50, n_estimators=100;, score=0.652 total time= 2.4min

update - error trace report - fit fail reason

he above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<timed exec> in <module>

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    889                 return results
    890 
--> 891             self._run_search(evaluate_candidates)
    892 
    893             # multimetric is determined here because in the case of a callable

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1390     def _run_search(self, evaluate_candidates):
   1391         """Search all candidates in param_grid"""
-> 1392         evaluate_candidates(ParameterGrid(self.param_grid))
   1393 
   1394 

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params, cv, more_results)
    836                     )
    837 
--> 838                 out = parallel(
    839                     delayed(_fit_and_score)(
    840                         clone(base_estimator),

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~\AppData\Roaming\Python\Python39\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()

~\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead
The Great
  • 2,525
  • 16
  • 40
  • I already edited the other question multiple times but it is closed and there is no response for either of the posts – The Great Mar 24 '22 at 14:18
  • Do you know why does a fit fail in the first place? Your response in other post say that nan is caused because fit fails. but why does fit fail? how can I avoid that? – The Great Mar 24 '22 at 14:21
  • btw, one of the two posts that you linked is not posted by me. just FYI – The Great Mar 24 '22 at 14:22
  • My answer https://datascience.stackexchange.com/a/92225/55122 suggests a way to find out why the fits are failing. Try that and report the traceback. – Ben Reiniger Mar 24 '22 at 14:25
  • @BenReiniger - I tried the error trace report. There is a message that `ck_sampling_strategy raise ValueError( ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead """ ` but my `y_train` has two values 0 and 1. Don't know why it says only one class is there – The Great Mar 24 '22 at 14:46
  • Pasted full error message in the post. Does balanced bagging classifier downsample and retain only one class or something? I do startaified k fold though – The Great Mar 24 '22 at 14:50
  • still the same error.. – The Great Mar 24 '22 at 14:54
  • my estimator is written like this `rf_boruta = BalancedBaggingClassifier(RandomForestClassifier(class_weight='balanced_subsample',max_depth=5,max_features='sqrt',n_estimators=300))`. And I pass this `rf_boruta` as input to gridsearchCV. Am I doing this right/ – The Great Mar 24 '22 at 14:56
  • Oh really? I thought they (used to?) skip joblib then. Oh well. Try using just a decision tree instead of random forest? Otherwise, I think you'll need to generate the splits manually and find one that gives the error to dig into. If you can provide a dataset that demonstrates the same issue, I can play around with it instead of going back and forth in the comments. – Ben Reiniger Mar 24 '22 at 14:56
  • Thanks for your help. I have no idea as to how can I generate/produce a new dataset that can help you reproduce. am new to data science and still learning lots of things – The Great Mar 24 '22 at 15:10
  • Is there anyway to do gridsearch type operation without gridsearchCV? – The Great Mar 24 '22 at 15:19

1 Answers1

5

First I want to make sure you know what you're building here. You're doing (balanced) bagging with between 100 and 1200 estimators, each of which is a random forest of 300 trees. So each model builds between $100\cdot300=30k$ and $1200\cdot300=360k$ trees. Your grid search has $5^3=125$ hyperparameter combinations, and 10 folds. So you're fitting on the order of $10^8$ individual trees.

The grid search splits your data into 10 pieces, stratified so that the class balance should be the same as in the whole dataset. Now the balanced bagging is set to use only 25 rows, but it's also using the default "not minority" method, which means it tries to only downsample the majority class. Those two together are impossible, so I'm not really sure what ends up happening (if I have some time I'll look into that later). Since not all your scores are nan, it obviously sometimes works. But now the scarce 25 rows are used to train a random forest, so conceivably sometimes one of the trees there selects a bag with no examples from one of the classes. I suspect that's the issue.

The BalancedBaggingClassifier with a single decision tree base estimator acts as a fancier random forest, so that'd be my recommendation. You also wouldn't need to set class_weights in the tree, since the balanced bags will already be equally divided. I would expect better performance with larger max_samples, but even without changing that now you'll expect ~12.5 rows of each class for each tree to build off of. If you really want to balanced-bag random forests, then definitely increase the number of rows reaching each tree.

Ben Reiniger
  • 11,094
  • 3
  • 16
  • 53
  • 1
    Thanks for the help. Upvoted. Let me read to absorb all the details. One quick question. I did a hyperparameter tuning for random forest to fit the model. When I predicted, the performance was less. So, I thought I will try balamcedbaggingclassifier. So, now ypu suggest the process is correct but remove the useless parameter of class_weight when I am using balamced bagging classifier. – The Great Mar 24 '22 at 15:56
  • You also suggest that I should increase the max_samples of balanced bagging to improve performance? Should I increase it to like 300, 500, 700 etc? – The Great Mar 24 '22 at 15:59
  • Can help me understand, how did you arrive at this number? Would be useful for me to learn - 'So each model builds between 15k and 360k trees' – The Great Mar 24 '22 at 16:00
  • do you even think in my case, is it necessary to do hyperparameter tuning for balanced bagging classifier? – The Great Mar 24 '22 at 16:01
  • You went from a random forest of trees to a balanced-bagging _of random forests_, whereas I'm suggesting you go to a balanced random forest, i.e. `BalancedBaggingClassifier(DecisionTreeClassifier(...), ...)`. The random forest hyperparameters can largely be migrated to balanced-bagging hyperparameters. A side effect is that you can drop `class_weight`. And yes, increase max_samples to something where the trees can build something useful; the hundreds range may be fine, or use floats to specify a fraction of the entire dataset. HP-tuning is always a good idea. – Ben Reiniger Mar 24 '22 at 16:05
  • Thanks. By any chance, do you do freelance consulting on stats? If yes, I may be interested to seek your help. Couldn't find your email id in the profile. So, asking here. – The Great Mar 24 '22 at 16:18