4

I am using Scikit-Learn for this classification problem. The dataset has 3 features and 600 data points with labels.

First I used Nearest Neighbor classifier. Instead of using cross-validation, I manually run the fit 5 times and everytime resplit the dataset (80-20) to training set and test set. The average score turns out to be 0.61

clf = KNeighborsClassifier(4)
score = 0
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        clf.fit(X_train, y_train)
        score += clf.score(X_test, y_test)
print(scores / 5.0)

However when I ran cross-validation, the average score is merely 0.45.

clf =  KNeighborsClassifier(4)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

Why does cross-validation produce significantly lower score than manual resampling?

I also tried Random Forest classifier. This time using Grid Search to tune the parameters:

param_grid = {
    'bootstrap': [True],
    'max_depth': [8, 10, 12],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [62, 64, 66, 68, 70]
}
clf_ = RandomForestClassifier()
grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(X, y)
grid_search.best_params_, grid_search.best_score_

The best score turned out to be 0.508 with the following parameters

({'bootstrap': True,
  'max_depth': 10,
  'min_samples_leaf': 4,
  'min_samples_split': 10,
  'n_estimators': 64},
 0.5081967213114754)

I went ahead make prediction on the whole 600 data points and accuracy is quite high 0.7688.

best_grid = grid_search.best_estimator_
y_pred = best_grid.predict(X)
accuracy_score(y, y_pred)

I know .best_score_ is the 'Mean cross-validated score of the best_estimator.` But I don't understand why it seems so much lower than the prediction accuracy on the whole set.

ddd
  • 203
  • 1
  • 2
  • 6

2 Answers2

2

In your random forest, this is due to the fact that your final model is overfitting. Sklearn's GridSearchCV has a default argument refit = True, that takes the model with the best performance based on cross-validation and retrains it in the whole dataset. Your accuracy score is very high due to the fact that it is only measured on your training data, and the best_score is a measure that incorporates how your model performs in models that it has not seen.

To wrap up, in your random forest you are overfitting very badly, as there is a big gap between your validation and training error. Try to use refit = False and you will not see this gap anymore (but you still have a problem since you are still overfitting your training set with this model).

David Masip
  • 5,981
  • 2
  • 23
  • 61
  • So part of the reason why I am overfitting is because of the refit? That's why I should set `refit = False` to avoid overfitting? – ddd Apr 24 '18 at 15:59
  • If you set refit to False you will not longer have such a big gap between your best score and accuracy score. However, this doesn't mean that your validation accuracy will be significantly higher. In order to have a better predictor I would work on reducing variance, thus using a simpler model or regularizing your model more. – David Masip Apr 24 '18 at 17:36
2

I know this question has been here for two years, however, I was having the same problem when using cross_val_score on my data and I ended up here.

The results returned from the cross_val_score function were very different from what I get when I do cross validation manually using train_test_split, as you were doing with the Nearest Neighbor classifier. Apparently, cross_val_score splits the data in its original order without shuffling. Therefore, when I shuffled my data using sklearn.utils.shuffle I got more consistent results with the manual cross validation.

I am new to scikitlearn so please forgive me if there is something wrong above.

yzz
  • 21
  • 2