2

I am taking a course that introduced me to sklearn.ensemble.RandomForestClassifier. At first it uses n_estimators with the default value of 10 and the resulting accuracy turns out to be around 0.28. If I change n_estimators to 15, the accuracy goes to 0.32

Here's some of the code:

pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier())
    ])

I thought that increasing the number of trees (n_estimators) in the RandomForestClassifier would give a better accuracy, but sometimes if I use a value of 100 I can get between 0.30 and 0.32. Could someone please explain? How do you find which is the smallest value for getting the highest possible accuracy?

desertnaut
  • 1,908
  • 2
  • 13
  • 23
Carmen
  • 21
  • 5
  • 1
    There is no `n_elements` argument in sklearn's [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html); if you mean `n_estimators`, this has a default value of 100, and not 10. Please **clarify**, as your shown code is actually irrelevant to the question. – desertnaut Oct 19 '20 at 23:44
  • I just noticed I typed n_elements instead of n_estimators, sorry about that. I am taking a course in DataCamp called _Case Study: School Budgeting with Machine Learning in Python_ that specifies it has 10 as default (even though in the documentation 100 is specified for the default) – Carmen Oct 19 '20 at 23:50
  • As can be seen in the documentation, the default was changed in version 0.22 from 10 to 100. – Ben Reiniger Oct 20 '20 at 14:06
  • The only consistent effect of `n_estimators` is that more trees reduces variance in the predictions (and takes more time to train). Any other apparent effect on performance is only due to random effects. https://datascience.stackexchange.com/q/1028/55122 – Ben Reiniger Oct 20 '20 at 14:10

1 Answers1

1

If you are talking about testing accuracy in this case (ie you are comparing results on data you didn't train with) - it's possible that adding more estimators is overfitting on your training set and is therefore performing poorly on your holdout set. If this is the case I would recommend approaching the problem with a more basic method such as LogisticRegression - as it is less likely to overfit when compared to ensemble methods.

As for finding the best parameters - try sklearn's RandomizedSearchCV to fine-tune your hyperparameters to maximize performance.

Oliver Foster
  • 862
  • 5
  • 12