5

enter image description here

Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.

In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .

I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?

Data can be found on https://www.kaggle.com/c/forest-cover-type-prediction/data

1 Answers1

1

Welcome to the site!

I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.

For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.

Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.

How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.

Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.

Do let me know if you have any additional questions.

Toros91
  • 2,352
  • 3
  • 14
  • 31
  • Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario? – Chenxiong Yi Dec 13 '17 at 07:27
  • So the data is normally distributed, what all features do you have? – Toros91 Dec 13 '17 at 07:28
  • https://www.kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform? – Chenxiong Yi Dec 13 '17 at 07:30
  • Can you explain the above statement with an example? – Toros91 Dec 13 '17 at 07:35
  • I just mean no class has more training data than other classes. Sorry for the confusion. – Chenxiong Yi Dec 13 '17 at 07:37
  • oh I see, then they are normally distributed. But looking at the data on the kaggle site I feel that you need to do work on adding new features(any external factors which you can think of), did you normalize the distance features? – Toros91 Dec 13 '17 at 07:44
  • Yes, I did. I normalized every feature before feeding data to algorithms, and I even add products of different distance features, but they are still not significant in reducing these two error rates. I feel that I probably need to look for external data, which may provide more information regarding these two classes. – Chenxiong Yi Dec 13 '17 at 07:48
  • yes exactly that might help you in getting better outcome, you took from 0-6 right? but in the outcome they are looking it as 1-7. – Toros91 Dec 13 '17 at 07:52
  • Yeah they are 1-7. I used 0-6 to be consistent with confusion matrix picture. – Chenxiong Yi Dec 13 '17 at 07:55
  • cool! try getting some external factors and update me if you need any help. Edited your question, appended the dataset link. – Toros91 Dec 13 '17 at 07:56
  • any update? If the answer was useful +1 is appreciated. – Toros91 Jan 31 '18 at 01:39