-1

I am trying to model my data with Python and i am having concerns about my binary target variable, because it has 90% cases falling in 0 and 10% of the cases falling in 1. I have tried upsampling my data and i got twice more observations than i had. I am not sure is it right to do it this way.

desertnaut
  • 1,908
  • 2
  • 13
  • 23
  • 1
    I like Frank Harrell’s (Vanderbilt stat professor and former chairman) approach to modeling such data: https://mobile.twitter.com/f2harrell/status/1062424969366462473. What is the problem you have, something like how a strong-looking accuracy of 89% is worse than random guessing? – Dave Oct 30 '20 at 21:20
  • 1
    Does this answer your question? [Clustering with imbalanced data and groups](https://datascience.stackexchange.com/questions/76868/clustering-with-imbalanced-data-and-groups) – Pedro Henrique Monforte Oct 31 '20 at 03:00

1 Answers1

1

There are a couple options:

  1. Up-sample your data (as you described). Use SMOTE or a similar method to up-sample the lesser class to achieve closer to 50/50 split on positive/negative class respectively

  2. If you have a lot of data - down-sample your more frequent class (thereby throwing away a lot of examples in the negative class)

  3. Select performance metrics that do not get skewed due to class imbalance. F1 score is usually used but any metric that has some combination of precision and recall should do the trick. Avoid accuracy as a scoring metric in this case. Selecting the correct scoring metric also depends on the specifics of the business problem you are trying to address.

Oliver Foster
  • 862
  • 5
  • 12
  • 2
    All of these are because of the use of improper scoring rules. Read what Frank Harrell has to say: https://mobile.twitter.com/f2harrell/status/1062424969366462473. Look at my post about using threshold-based scoring rules: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email. Look at Kolassa’s post on the drawbacks of threshold-based scoring rules: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/312787#312787. – Dave Oct 30 '20 at 23:27
  • @Dave Thanks for this - these resources are very interesting. I haven't heard this concern articulated in this way. +1. – Oliver Foster Oct 30 '20 at 23:42
  • These are the steps i am doing: – Martin Xristev Oct 31 '20 at 08:28
  • df_majority = bank_data[bank_data["Personal_Loan"] == 0] df_minority = bank_data[bank_data["Personal_Loan"] == 1] df_minority_upsampled = resample(df_minority, replace=True, n_samples=4468, random_state=123) df_upsampled = pd.concat([df_majority, df_minority_upsampled]) df_upsampled["Personal_Loan"].value_counts() 1 4468 0 4468 Name: Personal_Loan, dtype: int64 – Martin Xristev Oct 31 '20 at 08:31