Is this possible to be just a bad draw on the 20% or is it overfitting? I'd appreciate some tips on what's going on.
Asked
Active
Viewed 60 times
-1
-
What do you mean by "I removed some random substrings"? I want let you note that you merged validation and test sets and you evaluated your model on them; this means you can't use the same test to compute the generalization error (since you already used it). You also didn't told us what task is, what model you tried, what parameters you used – Gius May 15 '22 at 17:35
1 Answers
3
A few comments:
- You don't mention number of classes or distribution. Unless the classes are balanced, you should use precision/recall/f1-score instead of accuracy (if your majority class is 75%, accuracy can be 75% just by always predicting this class).
- It's also unclear what your validation set is used for?
- When your feature is represented as bag of words, it's not one feature anymore, it's as many as the vocabulary size. This is important because if it's very large you're very likely to have overfitting. Btw this is certainly why you improve performance when you remove some words.
- Generally you should remove all the rare words, which are useless for the model and often cause overfitting.
- A difference of 78% on the validation set down to 75% on the test set is not necessarily worrying, but that depends on other factors.
Erwan
- 24,823
- 3
- 13
- 34
-
You can definitely retrain on the whole data once you're satisfied with the model performance, of course this means you cannot evaluate the model anymore after that. It can improve performance a little but It's unlikely to improve performance a lot (if it does, it means that there's some serious overfitting happening and that's not good). – Erwan May 15 '22 at 18:12
-
@AndreiJarca how many features do you have after encoding? And is it binary classification? – Erwan May 15 '22 at 18:13
-
I didn't check the amount of features after, but the size of the whole vocabulary thing is over 1m, so probably a lot of features. It's on 3 types, so multiclass – May 15 '22 at 18:16
-
@AndreiJarca then it's almost sure that you have overfitting, because you have way more features than instances. My guess is that one of the classes is large, like around 75%, and the model does nothing but predict this class. You should remove everything in your vocab which happens less than N times, with N=2 at the very least. – Erwan May 15 '22 at 18:31
-