I have a dataset D:
X = D.drop(columns=['target'])
y = D['target']
D is large, but contains huge number of duplicates - and I want to speedup the learning process. I can't simply drop these duplicates, because this will lead to bias in the distribution. But I can do the following thing:
- Compute duplicate counts and save it separately as SW.
- Drop duplicates from the dataset and save it as D1.
Let
X1 = D1.drop(columns=['target'])
y1 = D1['target']
Now, I can estimate some metric on pre-trained model like this:
score(y1, model.predict(X1), sample_weight=SW)
this will be exactly equal to (tested for roc_auc_score):
score(y, model.predict(X))
Similary I can pass SW as sample_weight parameter to model.fit(...) function. model.fit(X, y) will not be equal to model.fit(X1, y1, sample_weight=SW) - but let it be.
The problem is - how do I can properly perform train_test_split of X1, y1, and SW? For example:
X y
feature 1
feature 1
feature 1
feature 1
feature 1
after duplicate removal it will be:
X1 y1 SW
feature 1 5
after train_test_split(test_size=0.2) I expect:
X1_train y1_train SW_train
feature 1 4
X1_test y1_test SW_test
feature 1 1
If this is not possible in Sklearn, is there any ML library, that provides this functionality?