Why is cross-validation score so low?

Question

I am using Scikit-Learn for this classification problem. The dataset has 3 features and 600 data points with labels.

First I used Nearest Neighbor classifier. Instead of using cross-validation, I manually run the fit 5 times and everytime resplit the dataset (80-20) to training set and test set. The average score turns out to be 0.61

clf = KNeighborsClassifier(4)
score = 0
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        clf.fit(X_train, y_train)
        score += clf.score(X_test, y_test)
print(scores / 5.0)

However when I ran cross-validation, the average score is merely 0.45.

clf =  KNeighborsClassifier(4)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

Why does cross-validation produce significantly lower score than manual resampling?

I also tried Random Forest classifier. This time using Grid Search to tune the parameters:

param_grid = {
    'bootstrap': [True],
    'max_depth': [8, 10, 12],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [62, 64, 66, 68, 70]
}
clf_ = RandomForestClassifier()
grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(X, y)
grid_search.best_params_, grid_search.best_score_

The best score turned out to be 0.508 with the following parameters

({'bootstrap': True,
  'max_depth': 10,
  'min_samples_leaf': 4,
  'min_samples_split': 10,
  'n_estimators': 64},
 0.5081967213114754)

I went ahead make prediction on the whole 600 data points and accuracy is quite high 0.7688.

best_grid = grid_search.best_estimator_
y_pred = best_grid.predict(X)
accuracy_score(y, y_pred)

I know .best_score_ is the 'Mean cross-validated score of the best_estimator.` But I don't understand why it seems so much lower than the prediction accuracy on the whole set.

About your nearest neighbors, can you try it many times to see if this behaviour is persistent? — David Masip, Apr 24 '18 at 05:29

score 2 · Accepted Answer · answered Apr 24 '18 at 05:28

In your random forest, this is due to the fact that your final model is overfitting. Sklearn's GridSearchCV has a default argument refit = True, that takes the model with the best performance based on cross-validation and retrains it in the whole dataset. Your accuracy score is very high due to the fact that it is only measured on your training data, and the best_score is a measure that incorporates how your model performs in models that it has not seen.

To wrap up, in your random forest you are overfitting very badly, as there is a big gap between your validation and training error. Try to use refit = False and you will not see this gap anymore (but you still have a problem since you are still overfitting your training set with this model).

So part of the reason why I am overfitting is because of the refit? That's why I should set `refit = False` to avoid overfitting? — ddd, Apr 24 '18 at 15:59
If you set refit to False you will not longer have such a big gap between your best score and accuracy score. However, this doesn't mean that your validation accuracy will be significantly higher. In order to have a better predictor I would work on reducing variance, thus using a simpler model or regularizing your model more. — David Masip, Apr 24 '18 at 17:36

score 2 · Answer 2 · answered Apr 14 '21 at 12:14

I know this question has been here for two years, however, I was having the same problem when using cross_val_score on my data and I ended up here.

The results returned from the cross_val_score function were very different from what I get when I do cross validation manually using train_test_split, as you were doing with the Nearest Neighbor classifier. Apparently, cross_val_score splits the data in its original order without shuffling. Therefore, when I shuffled my data using sklearn.utils.shuffle I got more consistent results with the manual cross validation.

I am new to scikitlearn so please forgive me if there is something wrong above.

Why is cross-validation score so low?

2 Answers2