SMOTE for multi-instance learning i.e num_rows(x_train) > num_rows(y_train)

Question

I have an imbalanced dataset and I wish to predict classes(0 or 1).

Sample x_train:

id      date    c1   c2 . . . . . .  c20
101  13-02-2015  2    7 . . . . . .   14
101  14-02-2015 24    7 . . . . . .    8
  .
  .
  .
105  13-02-2015 12    5 . . . . . . .  4
  . 
  .

Sample y_train

id   class
101    1
105    1
107    0
 .
 .
 .

Now I wish to over sample class 0 in the dataset but the problem is for each id I have just one row in y_train whereas I have 50 rows for the same id in x_train.

Tasos · Answer 1 · 2019-08-17T09:40:40.763

What you have here is called Multi-Instance Learning. From Wikipedia

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances.

Source: https://en.wikipedia.org/wiki/Multiple_instance_learning

The approach you take in this case is different. You need to bring the Multi-Instance Learning problem into a Single-Instance Learning one. One way you can do this is:

Perform for each bag of instances a K-Means clustering
Calculate the Hausdorff distance of each instance from each of the cluster
Use those distances as features and keep the label from the y_train set

Then apply SMOTE on the new dataset (where you have one row for features and label) and any kind of model of Single-Instance Learning.

You can find details in this Review of Multi-Instance Learning and Its applications

SMOTE for multi-instance learning i.e num_rows(x_train) > num_rows(y_train)

1 Answers1