Is a 100% model accuracy on out-of-sample data overfitting?

Question

I have just completed the machine learning for R course on cognitiveclass.ai and have begun experimenting with randomforests.

I have made a model by using the "randomForest" library in R. The model classifies by two classes, good, and bad.

I know that when a model is overfit, it performs well on data from its own trainingset but badly on out-of-sample data.

To train and test my model I have shuffled and split the complete dataset into 70% for training and 30% for testing.

My question: I am getting a 100% accuracy out of the prediction done on the testing set. Is this bad? It seems too good to be true.

The objective is waveform recognition on four on each other depending waveforms. The features of the dataset are the cost results of Dynamic Time Warping analysis of waveforms with their target waveform.

welcome to the site! Did you try predicting on some noise data? — Toros91, Feb 08 '18 at 09:15
Every time you reshuffle, train and test, the accuracy is 100%? — Alex, Feb 08 '18 at 09:19
@Toros91 I haven't tried that specifically but the original "bad" data does contains some very random data. With noise data do you mean randomly generated data? — Milan van Dijck, Feb 08 '18 at 09:26
just take couple of noise records and try testing and see how it performs. how many features do you have? if it is not confidential, can you state your objective? — Toros91, Feb 08 '18 at 09:28
as "Jan van der Vegt" has suggested make sure that the features are completely correlated with the target variable, will explain with example for better understanding: if you are trying to predict number of years and taking DOB as a feature. Have you done Predictor Importance test? — Toros91, Feb 08 '18 at 09:39
@Toros91 I have been looking at the predictor importance that the randomforest lays out. for the particular model it is `f1: 11.626811 - f2: 14.647147 - f3: 1.797175 - f4: 6.501746` — Milan van Dijck, Feb 08 '18 at 09:51
That's quite imbalanced. Try using resampling (repeated sampling) to tip the balance in the training set towards OK class (make it 30% for example) and keep the 11/89 ratio in the test/validation sets. What do you get? — Alex, Feb 08 '18 at 13:16
@Alex I made it about ±30% "ok" and ±70% "bad". The accuracy drops too: 97.10145%, So not much of a difference — Milan van Dijck, Feb 08 '18 at 15:29
1)try 50%, 2)jan may be right: your model is genuinely good, 3) try another classifier, mb SVM? — Alex, Feb 09 '18 at 09:52

score 29 · Accepted Answer · answered Feb 08 '18 at 09:27

High validation scores like accuracy generally mean that you are not overfitting, however it should lead to caution and may indicate something went wrong. It could also mean that the problem is not too difficult and that your model truly performs well. Two things that could go wrong:

You didn't split the data properly and the validation data also occured in your training data, meaning it does indicate overfitting because you are not measuring generalization anymore
You use some feature engineering to create additional features and you might have introduced some target leakage, where your rows are using information from it's current target, not just from others in your training set

100% accuracy always screams "target leakage." – Paul Feb 08 '18 at 14:47 — Paul, Feb 08 '18 at 14:47

score 1 · Answer 2 · answered Feb 08 '18 at 15:19

1

Investigate to see what your most predictive features are. Sometimes you accidentally included your target (or something that is equivalent to your target) amongst your features.

answered Feb 08 '18 at 15:19

tom

2,228
11
13

Is a 100% model accuracy on out-of-sample data overfitting?

2 Answers2