0

Problem:

  1. Running SVM in GridSearchCV is faster than running without it and supplying only 1 value of C and no CV.
  2. The AUC on the test set is lower when SVM is run outside of GridSearchCV.

Background: I am trying to run an SVM classifier. Some background about the data, I have 1732 features in my dataset and about 7000 datapoints. In order to reduce the dimensionality, I run PCA and it explains 95% variance using 261 features.

Code for PCA:

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import roc_auc_score

scaler2 = StandardScaler()
train_set_standardized = scaler2.fit_transform(train_set)
test_set_standardized = scaler2.transform(test_set)

pca_variance = 0.95
pca = PCA(pca_variance)
train_set_pca = pca.fit_transform(train_set_standardized)
test_set_pca = pca.transform(test_set_standardized)

print("Number of principal components: {} \t PCA Variance: {}".format(pca.n_components_, pca_variance))

start_time = time.time()

param_grid = [
  # {'C': [0.1,1,5,7], 'kernel': ['linear']}  
    {'C': [0.1,1,5,10], 'max_iter':[10000]}  
 ]
svc = svm.SVC()
clf = GridSearchCV(svc, param_grid, cv=5)
clf.fit(train_set_pca, train_lbl)
end_time = time.time()
print("Total time on GRID Search CV:", end_time - start_time)
print("CV Results:")
print(clf.cv_results_)


preds = clf.best_estimator_.predict(test_set_pca)
print("AUC:",roc_auc_score(preds, test_lbl))

Output:

Number of principal components: 261      PCA Variance: 0.95
Total time on GRID Search CV: 28.879327297210693
CV Results:
{'mean_fit_time': array([0.92655878, 0.90988474, 0.95111265, 0.96867824]),
 'std_fit_time': array([0.0274404 , 0.00936632, 0.01084986, 0.01841955]),
 'mean_score_time': array([0.41218376, 0.41684995, 0.42106729, 0.41611543]),
 'std_score_time': array([0.00447276, 0.00562701, 0.00456414, 0.00288679]),
 'param_C': masked_array(data=[0.1, 1, 5, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_max_iter': masked_array(data=[10000, 10000, 10000, 10000],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.1, 'max_iter': 10000},
  {'C': 1, 'max_iter': 10000},
  {'C': 5, 'max_iter': 10000},
  {'C': 10, 'max_iter': 10000}],
 'split0_test_score': array([0.89938217, 0.90026478, 0.92586055, 0.92674316]),
 'split1_test_score': array([0.90017668, 0.90017668, 0.91872792, 0.91872792]),
 'split2_test_score': array([0.90017668, 0.90459364, 0.91784452, 0.92226148]),
 'split3_test_score': array([0.90017668, 0.90194346, 0.91784452, 0.91784452]),
 'split4_test_score': array([0.90017668, 0.90371025, 0.91696113, 0.91696113]),
 'mean_test_score': array([0.90001778, 0.90213776, 0.91944773, 0.92050764]),
 'std_test_score': array([0.0003178 , 0.00178301, 0.00325472, 0.00359986]),
 'rank_test_score': array([4, 3, 2, 1], dtype=int32)}

AUC:  0.8330464716006885

Best estimator above was at C=10.

So, when I run SVM in the following manner, it should give the same results: Code:

start_time = time.time()
svc = svm.SVC(C=10, kernel = 'linear')
svc.fit(train_set_pca, train_lbl)
end_time = time.time()
print("Total time on SVM:", end_time - start_time)
preds = svc.predict(test_set_pca)
print("AUC:",roc_auc_score(preds, test_lbl))

Output:

Total time on SVM: 516.9719932079315
AUC: 0.7128378378378378

0 Answers0