1

I am currently working on a project classifying text into classes. The specific problem is classifying job titles into various industry codes. For example "McDonalds Employee" might get classified to 11203 (there are a few hundred classes in the problem). For this we are using FastText.

The person that I am working with insists on removing duplicate records from the data before training our model. That is, we might see 100 records with "McDonalds Employee" and class 11203 and he wants to remove all but one of them. His argument is that not doing so could result in over-fitting and an optimistic error rate as the same records will appear in all the train/test/validation sets. My counter to this is that I expect to see (many) records with "McDonalds Employee" in our future data and I would like to know how the model is going to do at predicting this, hence we will not arrive at an optimistic error rate but a properly calculated error rate. Secondly, if our data for some reason has one record "McDonalds Employee" with class 24444, removing the duplicates removes all evidence that the correct code is 11203.

I have read other posts here that suggest removing duplicates is not correct, but I have yet to see an actual source in the literature stating this. Since I have to convince a colleague my question is two fold: Does anyone know of any reference in the literature that suggests keeping duplicates? As well, is there any reason to remove duplicates specific to FastText? I admit I am not that familiar with NLP and FastText (or even neural networks in general), so it is possible maybe there is some reason to remove them when training a model of this type.

astel
  • 347
  • 1
  • 5
  • generally, the test set should not include the same information as in the train set. So duplicates in both sets are a problem IMO. Have a look here: https://stats.stackexchange.com/questions/20010/how-can-i-help-ensure-testing-data-does-not-leak-into-training-data – Peter May 14 '20 at 20:08
  • Can you explain why? Particularly in light Of what I said about my expectations that these duplicate records will also show up in my future data – astel May 14 '20 at 21:31
  • also posted at https://stats.stackexchange.com/q/466526/232706 – Ben Reiniger May 14 '20 at 21:33
  • The test set serves the purpose of checking if the model is able to detect cases „never seen before“. This is what you want to achieve after all. So there is no point in having identical „rows“ in both sets. This should be avoided. – Peter May 14 '20 at 22:01
  • Please don‘t cross post on several SE forums. This usually is a reason to close the question. Cheers – Peter May 14 '20 at 22:02
  • Why do you want to test on data never seen before? So you can understand how your model generalizes to future data. Well if your future data contains those cases then what is the issue. You are not answering my question and your link does not address it either. – astel May 14 '20 at 22:16

0 Answers0