Is there a problem of over fitting in my dataset?

Question

I have applied the sequential forward selection to my dataset having 214 samples and 515 features (2 class problem). The feature selection algorithm has selected 8 features. Now I have applied the svm (MATLAB) on these 8 features. I have also tried to see the performance after adding more features. The table given below gives the correct rate of the algorithm (training data set) along with the feature set used. The result obtained is:

8 features = 0.9392
10 features = 0.9439
12 features = 0.9672
14 features = 0.9672
16 features = 0.9626
18 features = 0.9766
20 features = 0.9672

As visible, the accuracy seems to increase. Is it because of over fitting? Should I use the default feature set as given by the sequentialfs function of Matlab or should I force it to deliver more features to get more accuracy?
I have uploaded the validation training and testing performance (70-15-15). Now can you tell me if my data is being over-fitted or not?

Is that the performance on the testing data set? Do also post the performance on the training dataset. (Check out the concept of `Validation Curves` too) — Dawny33, Aug 08 '16 at 04:42
@Dawny33 no this is training data set, i have not used used this model for prediction yet. — girl101, Aug 08 '16 at 04:47
Then do plot of validation curve of your training error and testing error with varying feature set sizes. If the training error is doing really good, while your testing error is very less, then maybe you're overfitting. — Dawny33, Aug 08 '16 at 04:52
@Dawny33 you mean if my training error is low and my testing error is high then I am over-fitting. Right? — girl101, Aug 08 '16 at 04:53
Yes. Exactly. Do check out the concept of `Validation Curves` too :) — Dawny33, Aug 08 '16 at 04:56
By publishing those confusion matrices, it changes the question quite a bit. I'd say there is no evidence of over-fitting from the graphs. You should bear in mind that almost all models will over or under fit slightly, but that isn't usually a concern - getting best possible generalisation in your model is the real goal. To achieve that for your data you will want to look into k-fold cross-validation (as suggested in my answer). — Neil Slater, Aug 08 '16 at 07:31
@Dawny33 what should be the difference in the training and testing performance percentage? Or what should be the difference in the error rates of training and testing dataset for which I would consider that my model is not overfitting? — girl101, Aug 10 '16 at 08:31
@Rishika Depends on the problem statement and the analyst. If it's a considerable huge difference, like about `>=10%`, then it might be overfitting. — Dawny33, Aug 10 '16 at 08:40

Neil Slater · Accepted Answer · 2016-08-08T13:49:04.907

It is not possible to tell whether a machine learning algorithm is overfitting based purely on the training set accuracy.

You could be right, that using more features with a small data set increases sampling error and reduces the generalisation of the SVM model you are building. It is a valid concern, but you cannot say that for sure with only this worry and the training accuracy to look at.

The usual solution to this is to keep some data aside to test your model. When you see a high training accuracy, but a low test accuracy, that is a classic sign of over-fitting.

Often you are searching for the best hyper-parameters to your model. In your case you are trying to discover the best number of features to use. When you start to do that, you will need to make multiple tests in order to pick the best hyper-parameter values. At that point, a single test set becomes weaker measure of true generalisation (because you have had several attempts and picked best value - just by selection process you will tend to over-estimate the generalisation). So it is common practice to split the data three ways - training set, cross-validation set and test set. The cross-validation set is used to check accuracy as you change the parameters of your model, you pick the best results and then finally use the test set to measure accuracy of your best model. A common split ratio for this purpose is 60/20/20.

Taking a pragmatic approach when using the train/cv/test split, it matters less that you are over or under fitting than simply getting the best result you can with your data and model class. You can use the feedback on whether you are over-fitting (high training accuracy, low cv accuracy) in order to change model parameters - increase regularisation when you are over-fitting for example.

When there are a small number of examples, as in your case, then the cv accuracy measure is going to vary a lot depending on which items are in the cv set. This makes it hard to pick best hyper-params, because it may just be noise in the data that makes one choice better than another. To reduce the impact of this, you can use k-fold cross-validation - splitting your train/cv data multiple times and taking an average measure of the accuracy (or whatever metric you want to maximise).

In your confusion matrices, there is no evidence of over-fitting. A training accuracy of 100%* and testing accuracy of 93.8% are suggestive of some degree of over-fit, but the sample size is too low to read anything into it. You should bear in mind that balance between over- and under- fit is very narrow and most models will do one or the other to some degree.

* A training accuracy of 100% is nearly always suggestive of overfit. Matched with e.g. 99% test accuracy, you may not be too concerned. The question is at worst "could I do better by increasing regularisation a little"? However, matched with ~60% test accuracy it is clear you have actually overfit - even then you might be forced to accept the situation if that's the best you could achieve after trying many different hyperparameter values (including some attempts with increased regularisation).

I'm a simple data guy. If I see a `cross validation` advice, I give a +1 :D — Dawny33, Aug 08 '16 at 07:23
@NeilSlater Why three: training set, cross-validation set and test set? I though cross-validation is a technique to use over the training set. For example using a 10-fold crossvalidation, you train/test grouoing the training set into 10 different groups. — Ignacio Alorre, Oct 02 '18 at 07:39
@IgnacioAlorre: https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set - but yes you can use *k-fold* cross validation, which is effective but more time-consuming. I already mention that in the answer. — Neil Slater, Oct 02 '18 at 07:41
@NeilSlater Thanks for the quick reply. But there the debate is cross-validation vs test set. What I mean is use the training set for a 10-fold cross validation, and later use the test set to get some metrics over the prediction obtained. So instead of trainin set and cross validation set, as you pointed out, apply the 10 fold cross validation over the training set. — Ignacio Alorre, Oct 02 '18 at 07:43
@IgnacioAlorre: Yes, I already reference k-fold cross-validation in the answer. Whether you view that as having "2 data sets" or "repeatedly splitting into train/cv sets" is just semantics - probably driven by whether you are implementing it yourself, or feeding the whole training set into a method which does it for you. — Neil Slater, Oct 02 '18 at 07:46
@NeilSlater Perfect. It is clear now, didnt that the itnernal cross-validation splits where named itself training set and cross validation set. — Ignacio Alorre, Oct 02 '18 at 07:48

Is there a problem of over fitting in my dataset?

1 Answers1