Problem:
- Running SVM in GridSearchCV is faster than running without it and supplying only 1 value of C and no CV.
- The AUC on the test set is lower when SVM is run outside of GridSearchCV.
Background: I am trying to run an SVM classifier. Some background about the data, I have 1732 features in my dataset and about 7000 datapoints. In order to reduce the dimensionality, I run PCA and it explains 95% variance using 261 features.
Code for PCA:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import roc_auc_score
scaler2 = StandardScaler()
train_set_standardized = scaler2.fit_transform(train_set)
test_set_standardized = scaler2.transform(test_set)
pca_variance = 0.95
pca = PCA(pca_variance)
train_set_pca = pca.fit_transform(train_set_standardized)
test_set_pca = pca.transform(test_set_standardized)
print("Number of principal components: {} \t PCA Variance: {}".format(pca.n_components_, pca_variance))
start_time = time.time()
param_grid = [
# {'C': [0.1,1,5,7], 'kernel': ['linear']}
{'C': [0.1,1,5,10], 'max_iter':[10000]}
]
svc = svm.SVC()
clf = GridSearchCV(svc, param_grid, cv=5)
clf.fit(train_set_pca, train_lbl)
end_time = time.time()
print("Total time on GRID Search CV:", end_time - start_time)
print("CV Results:")
print(clf.cv_results_)
preds = clf.best_estimator_.predict(test_set_pca)
print("AUC:",roc_auc_score(preds, test_lbl))
Output:
Number of principal components: 261 PCA Variance: 0.95
Total time on GRID Search CV: 28.879327297210693
CV Results:
{'mean_fit_time': array([0.92655878, 0.90988474, 0.95111265, 0.96867824]),
'std_fit_time': array([0.0274404 , 0.00936632, 0.01084986, 0.01841955]),
'mean_score_time': array([0.41218376, 0.41684995, 0.42106729, 0.41611543]),
'std_score_time': array([0.00447276, 0.00562701, 0.00456414, 0.00288679]),
'param_C': masked_array(data=[0.1, 1, 5, 10],
mask=[False, False, False, False],
fill_value='?',
dtype=object),
'param_max_iter': masked_array(data=[10000, 10000, 10000, 10000],
mask=[False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'C': 0.1, 'max_iter': 10000},
{'C': 1, 'max_iter': 10000},
{'C': 5, 'max_iter': 10000},
{'C': 10, 'max_iter': 10000}],
'split0_test_score': array([0.89938217, 0.90026478, 0.92586055, 0.92674316]),
'split1_test_score': array([0.90017668, 0.90017668, 0.91872792, 0.91872792]),
'split2_test_score': array([0.90017668, 0.90459364, 0.91784452, 0.92226148]),
'split3_test_score': array([0.90017668, 0.90194346, 0.91784452, 0.91784452]),
'split4_test_score': array([0.90017668, 0.90371025, 0.91696113, 0.91696113]),
'mean_test_score': array([0.90001778, 0.90213776, 0.91944773, 0.92050764]),
'std_test_score': array([0.0003178 , 0.00178301, 0.00325472, 0.00359986]),
'rank_test_score': array([4, 3, 2, 1], dtype=int32)}
AUC: 0.8330464716006885
Best estimator above was at C=10.
So, when I run SVM in the following manner, it should give the same results: Code:
start_time = time.time()
svc = svm.SVC(C=10, kernel = 'linear')
svc.fit(train_set_pca, train_lbl)
end_time = time.time()
print("Total time on SVM:", end_time - start_time)
preds = svc.predict(test_set_pca)
print("AUC:",roc_auc_score(preds, test_lbl))
Output:
Total time on SVM: 516.9719932079315
AUC: 0.7128378378378378