1

I have an imbalanced dataset and I wish to predict classes(0 or 1).

Sample x_train:

id      date    c1   c2 . . . . . .  c20
101  13-02-2015  2    7 . . . . . .   14
101  14-02-2015 24    7 . . . . . .    8
  .
  .
  .
105  13-02-2015 12    5 . . . . . . .  4
  . 
  .

Sample y_train

id   class
101    1
105    1
107    0
 .
 .
 .

Now I wish to over sample class 0 in the dataset but the problem is for each id I have just one row in y_train whereas I have 50 rows for the same id in x_train.

sophros
  • 209
  • 2
  • 11
yamini goel
  • 711
  • 3
  • 7
  • 14

1 Answers1

1

What you have here is called Multi-Instance Learning. From Wikipedia

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances.

Source: https://en.wikipedia.org/wiki/Multiple_instance_learning

The approach you take in this case is different. You need to bring the Multi-Instance Learning problem into a Single-Instance Learning one. One way you can do this is:

  1. Perform for each bag of instances a K-Means clustering
  2. Calculate the Hausdorff distance of each instance from each of the cluster
  3. Use those distances as features and keep the label from the y_train set

Then apply SMOTE on the new dataset (where you have one row for features and label) and any kind of model of Single-Instance Learning.

You can find details in this Review of Multi-Instance Learning and Its applications

Tasos
  • 3,860
  • 4
  • 22
  • 54