10

I have a very imbalanced dataset with the ratio of the positive samples to the negative samples being 1:496. The scoring metric is the f1 score and my desired model is LightGBM. I am using the sklearn implementation of LightGBM.

I have read the docs on the class_weight parameter in LightGBM:

class_weight : dict, 'balanced' or None, optional (default=None) Weights associated with classes in the form {class_label: weight}. Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

On using the class_weight parameter on my dataset, which is a binary classification problem, I got a much better score (0.7899) than when I used the recommended scale_pos_weight parameter (0.2388). Should I use the class_weight parameter or the scale_pos_weight parameter to balance the classes?

Ethan
  • 1,625
  • 8
  • 23
  • 39
  • I haven't tried this myself, but the docs say to use 'is_unbalance', which I presume is the "class_weight" equivalent. Note that your probabilities will be haywire after this, and you will want to calibrate if you intend to use the probabilities. – ngopal Nov 26 '19 at 23:24
  • Cross-posted at https://stats.stackexchange.com/q/413596/232706 – Ben Reiniger Feb 24 '20 at 00:55

1 Answers1

13

You can achieve the same results by using either class_weight, scale_pos_weight and is_unbalanced for binary classification on unbalanced dataset.

Setting

class_weight = {0: (number of negative samples / number of positive samples), 
                1: (number of positive samples / number of negative samples)}

is the same as setting is_unbalance = True or scale_pos_weight = (no. of negative samples / number of positive samples).

Mayank Mahawar
  • 131
  • 1
  • 4