Learning with duplicate count as sample weights

Question

I have a dataset D:

X = D.drop(columns=['target'])
y = D['target']

D is large, but contains huge number of duplicates - and I want to speedup the learning process. I can't simply drop these duplicates, because this will lead to bias in the distribution. But I can do the following thing:

Compute duplicate counts and save it separately as SW.
Drop duplicates from the dataset and save it as D1.

Let

X1 = D1.drop(columns=['target'])
y1 = D1['target']

Now, I can estimate some metric on pre-trained model like this:

score(y1, model.predict(X1), sample_weight=SW)

this will be exactly equal to (tested for roc_auc_score):

score(y, model.predict(X))

Similary I can pass SW as sample_weight parameter to model.fit(...) function. model.fit(X, y) will not be equal to model.fit(X1, y1, sample_weight=SW) - but let it be.

The problem is - how do I can properly perform train_test_split of X1, y1, and SW? For example:

X               y
feature         1  
feature         1
feature         1
feature         1
feature         1

after duplicate removal it will be:

X1              y1        SW
feature         1         5

after train_test_split(test_size=0.2) I expect:

X1_train      y1_train    SW_train
feature       1           4

X1_test       y1_test     SW_test
feature       1           1

If this is not possible in Sklearn, is there any ML library, that provides this functionality?

score 0 · Answer 1 · answered Apr 17 '23 at 06:48

The most straight forward way to solve your problem is to split the data set before you replace duplicates with weights, and then replace the duplicates with weights only in the training set. This maintains the idea that replacing duplicates with proportional sample weight is an implementation detail in the training process where everything you can do to minimize the size of the data matters.

Learning with duplicate count as sample weights

1 Answers1