Why do machine learning engineers insist on training with more data than validation set?

Question

Among my colleagues I have noticed a curious insistence on training with, say, 70% or 80% of data and validating on the remainder. The reason it is curious to me is the lack of any theoretical reasoning, and it smacks of influence from a five-fold cross-validation habit.

Is there any reason for choosing a larger training set when attempting to detect overfitting during training? In other words, why not use $n^{0.75}$ for training and $n - n^{0.75}$ for validation if the influence really is from cross-validation practices carried over from linear modeling theory as I suggest in this answer?

I posted a similar question on stats.stackexchange.com but based on the response thought I might have a more interesting discussion here. The concept of training for multiple epochs is, in my opinion, inherently Bayesian and thus the concept of cross-validation may be ill-suited at worst, unnecessary at best, for reasons I suggest in that post.

score 2 · Accepted Answer · answered Dec 29 '20 at 09:44

2

The reasoning will be: "The more data for training the better". Then you have to keep in mind that the validation/hold-out set has to resemble how it should work on production/testing. The theory is that the larger the training data, the better the model should generalize.

The validation set can be much smaller, on extremely big dataset you can make it even 0.01% of the data, and there should be no problem.

In basic cases you don't even need to do the K-fold, this makes the training more expensive and only for hyperparameter search and inside the training set it needs to be done.

For your case, you can consider the split you want. Just keep the balance of the training data to be as larger as possible and validation data to resemble the best the production environment as possible.

answered Dec 29 '20 at 09:44

Carlos Mougan

6,011
2
15
45

@user86339 but at deployment time (after model and hyper-parameter selection) would you not use *all* available data to train the model, validating on a pre-determined held-out "test" set which has never been used for model or hyper-parameter selection before? – brethvoice Dec 29 '20 at 18:35
Would you? Another discussion is if the hype parameters for the train data will be the same than for the whole data. There is another question on this forum about this – Carlos Mougan Dec 29 '20 at 20:27
@user86339 please link to the other question you mention here in comments to this one. – brethvoice Dec 29 '20 at 21:59
1

https://datascience.stackexchange.com/questions/64767/new-parameters-in-final-training – Carlos Mougan Dec 30 '20 at 14:28
The more training data, the better...is not actually true, unless one is attempting to overfit. – brethvoice Jun 23 '21 at 15:31
1

@brethvoice I dont agree with your claim. I dont see the relation bt more data and overiffing – Carlos Mougan Jul 12 '21 at 07:30
@user86339 maybe we can compromise and agree that for hyper-parameter tuning it makes sense to use a validation data set larger than (any of) the one(s) used for training? I seriously hope nobody out there is actually using n-fold cross-validation for hyper-parameter search... – brethvoice Jul 12 '21 at 12:40

Why do machine learning engineers insist on training with more data than validation set?

1 Answers1