12

I have just completed the machine learning for R course on cognitiveclass.ai and have begun experimenting with randomforests.

I have made a model by using the "randomForest" library in R. The model classifies by two classes, good, and bad.

I know that when a model is overfit, it performs well on data from its own trainingset but badly on out-of-sample data.

To train and test my model I have shuffled and split the complete dataset into 70% for training and 30% for testing.

My question: I am getting a 100% accuracy out of the prediction done on the testing set. Is this bad? It seems too good to be true.

The objective is waveform recognition on four on each other depending waveforms. The features of the dataset are the cost results of Dynamic Time Warping analysis of waveforms with their target waveform.

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
Milan van Dijck
  • 123
  • 1
  • 6
  • welcome to the site! Did you try predicting on some noise data? – Toros91 Feb 08 '18 at 09:15
  • Every time you reshuffle, train and test, the accuracy is 100%? – Alex Feb 08 '18 at 09:19
  • @Alex Not exactly but it stays very high like 98,55% – Milan van Dijck Feb 08 '18 at 09:24
  • @Toros91 I haven't tried that specifically but the original "bad" data does contains some very random data. With noise data do you mean randomly generated data? – Milan van Dijck Feb 08 '18 at 09:26
  • just take couple of noise records and try testing and see how it performs. how many features do you have? if it is not confidential, can you state your objective? – Toros91 Feb 08 '18 at 09:28
  • @Toros91 added the objective as an edit – Milan van Dijck Feb 08 '18 at 09:36
  • as "Jan van der Vegt" has suggested make sure that the features are completely correlated with the target variable, will explain with example for better understanding: if you are trying to predict number of years and taking DOB as a feature. Have you done Predictor Importance test? – Toros91 Feb 08 '18 at 09:39
  • Whats the split between classes? – Alex Feb 08 '18 at 09:43
  • 1
    @Alex 11.35% "ok" and 88.65% "bad" – Milan van Dijck Feb 08 '18 at 09:47
  • @Toros91 I have been looking at the predictor importance that the randomforest lays out. for the particular model it is `f1: 11.626811 - f2: 14.647147 - f3: 1.797175 - f4: 6.501746` – Milan van Dijck Feb 08 '18 at 09:51
  • 1
    That's quite imbalanced. Try using resampling (repeated sampling) to tip the balance in the training set towards OK class (make it 30% for example) and keep the 11/89 ratio in the test/validation sets. What do you get? – Alex Feb 08 '18 at 13:16
  • @Alex I made it about ±30% "ok" and ±70% "bad". The accuracy drops too: 97.10145%, So not much of a difference – Milan van Dijck Feb 08 '18 at 15:29
  • 1)try 50%, 2)jan may be right: your model is genuinely good, 3) try another classifier, mb SVM? – Alex Feb 09 '18 at 09:52

2 Answers2

29

High validation scores like accuracy generally mean that you are not overfitting, however it should lead to caution and may indicate something went wrong. It could also mean that the problem is not too difficult and that your model truly performs well. Two things that could go wrong:

  • You didn't split the data properly and the validation data also occured in your training data, meaning it does indicate overfitting because you are not measuring generalization anymore
  • You use some feature engineering to create additional features and you might have introduced some target leakage, where your rows are using information from it's current target, not just from others in your training set
Jan van der Vegt
  • 9,328
  • 34
  • 52
1

Investigate to see what your most predictive features are. Sometimes you accidentally included your target (or something that is equivalent to your target) amongst your features.

tom
  • 2,228
  • 11
  • 13