0

I have a dataset which consists of attributes on breakdown of machines.The target variable is machine status which are populated with ones and zeros. The distribution of ones and zeros are given below

 0 - 19628
 1 - 225

0 - signifies the machine is running good and 1 signifies there was a breakdown.

Now, should I go by splitting the dataset using scikit train_test_split method ?. or introduce artificial rows to mitigate the tradeoff between ones and zeros and then split the dataset.

Well, What do I mean by artificial rows ? Populate some random data with having target variable as 1's But that would ultimately mislead the system. I don't see any other options or alternatives.

Is there any way how to make samples balanced?

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
James K J
  • 447
  • 1
  • 5
  • 15
  • You can look at this post of mine... https://datascience.stackexchange.com/a/31385/51714 You can also consider voting it if you find it usefull – ignatius Nov 08 '18 at 12:25
  • 2
    You can solve the problem of imbalanced data by oversampling and undersampling – Noran Nov 08 '18 at 12:41
  • Is yours a classification problem? If your distribution represents the true distribution then you should not artificially try to change the number of samples. – Anshul G. Nov 08 '18 at 13:02
  • @AnshulG. Yes, it is a classification problem. – James K J Nov 08 '18 at 14:07
  • Then you should consider the true distribution. Artificially increasing the samples in the imbalanced class would make the model believe that getting the samples right in the second class is as important as getting them right for the first class and thus reduce your overall accuracy. On the other hand if getting the predictions for the imbalanced class is more important then you may want to consider increasing your samples/ going for a bayesian approach. – Anshul G. Nov 08 '18 at 18:34

1 Answers1

1

SMOTE is a python library popularly used with unbalanced datasets like yours - it applies resampling to create a new balanced dataset. You can find some example implementation here.

Maybe you will also find this question and answers relevant, as it contains advice on limiting bias during modeling and validation of unbalanced data classifier.

missrg
  • 560
  • 2
  • 12