How to plan a model analysis that avoids overfitting?

Question

Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.

I'm working on a little project where my dataset have 6k lines and around 300 features.

As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).

As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.

I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.

If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere? Is it even OK to do try several algorithms? (it would not always be in statistics)

In case it matters, I'm working with python's scikit-learn module.

*PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?

Esmailian · Accepted Answer · 2019-03-16T09:51:15.453

I guess that if I re-run the program until metrics are good, my algorithm will be overfitted

Re-running an algorithm does not contribute to over-fitting. Re-run and select the best model no problem. However, when we try to compare two algorithms we must average over large enough number of re-runs. However, in practice, we want the best model not the best algorithm.

I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.

Re-sampling the training set only contributes to what is known as model variance, which denotes the fact that different training samples yield different models.

As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results

Your observation is natural. A general approach to decreasing the variance of KNN is to increase the parameter $K$, higher $K$ means KNN looks at more points around a query point (see these plots).

If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere?

Random sampling is OK.

In python, sklearn.model_selection.cross_validate would build and validate K models for a given algorithm and returns K results; assuming we are feeding 80% of data to K-fold CV, it splits the 80% to 80(K-1)/K% training set and 80/K% validation set for K times.

In summary, first split the data 20%-80%, do cross validation on 80% of data for each algorithm (algorithms would be 1-NN, 2-NN, SVM, etc.), then select a model with the best [validation] accuracy (set return_estimator = True to get K models per algorithm from K-fold CV, so for 3 algorithms we are selecting among 3K models), and finally test the best model on the held-out 20% to get the test accuracy; cross-validation has no meaning here since the best model is already built. The final result is the test accuracy.

Also, take a look at this answer on train, validation, test sets.

Is it even OK to do try several algorithms? (it would not always be in statistics)

Yes. Always try various algorithms.

A side note on parameter finding

We can apply the above procedure on (1-NN, 2-NN, 3-NN), and if the winner model is always selected from 3-NN algorithm, then we can limit our experiments to only (3-NN, SVM) instead of all the four. Otherwise, if the winner is not consistently from 3-NN, we can experiment with all the four.

Maybe terms were lost in translation but I learned that you should use 3 samples (training, validation and test) and that you can use cross validation (with `model_selection.cross_validate`) on the two formers to auto-selected the best K hyperparameter (hence having only 2 populations is ok). I'm a bit confused here, should I do cross validation on the test sample too ? With `model_selection.cross_validate` too ? — Dan Chaltiel, Mar 15 '19 at 22:16
Thanks for your edits. So crossvalidation is not only used to select hyperparameters (the KNN's K) but also to select models type (KNN vs logistic) ? — Dan Chaltiel, Mar 16 '19 at 09:41
A last real-world question if I may. If I'm working several days on some project, the code which split my population into train and test (`model_selection.train_test_split` in `scikit`) will be called each day. Is this an issue ? If yes, is there a optimized method to deal with it or should I just save the result as `train.csv` and `test.csv`? — Dan Chaltiel, Mar 16 '19 at 10:53
@DanChaltiel For the sake of comparability of test results among models built over many days, you may do the splitting only once to have a fixed final test set. This way you can say model 1 from day 1 with test accuracy 90% is better than model N from day D with test accuracy 88%, since the test set is the same. — Esmailian, Mar 16 '19 at 11:23

How to plan a model analysis that avoids overfitting?

1 Answers1