Hello StackExchange community,
I am trying to apply Nested Cross Validation on a pipeline to get a reliable estimate of the generalization error of my model. The pipeline includes two steps:
- Scaling the data
- Building a logistic regression model.
The issue I am experiencing is as follows: When I run code below with scaling as part of the pipeline, the first outer loop runs quickly, i.e. prints the results in a maximum of 1-2 minutes. However, something gets stuck on the second outer loop. I have tried this various times, and the outer loop runs for hours and still does not print any results. I am unclear on why.
To evaluate the code and pipeline, when I try without scaling, the entire nested cross validation procedure runs within minutes on my computer.
I have pasted my code below, but please note I have replaced a reference to my actual dataset with a synthetic dataset, as the actual dataset is proprietary. I do not experience any runtime issues on the synthetic dataset, but I suspect that this is because the make_classification code already produces scaled data.
Any advice that anyone could provide would be much appreciated. I know NCV is much more computationally expensive, but I am confused why:
- It runs relatively quickly without feature scaling and
- It runs the first outer loop iteration with scaling relatively quickly, but the second (and presumably subsequent ones) for an eternity.
Thank you!
#Importing libraries
from numpy import mean
from numpy import std
import sklearn.metrics as metrics
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X_combined, Y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
#Setting up the outer cross validation procedure
cv_outer = KFold(n_splits=2, shuffle=True, random_state=1)
outer_results_f1 = list()
for train_ix, test_ix in cv_outer.split(X_combined):
#Split data
X_train, X_test = X_combined[train_ix, :], X_combined[test_ix, :]
y_train, y_test = Y[train_ix], Y[test_ix]
#Setting up the inner cross validation procedure
cv_inner = RepeatedStratifiedKFold(n_splits=2, n_repeats=1, random_state=1)
#Initializing the model
model = LogisticRegression(solver = 'liblinear', random_state=1, max_iter=3000)
#Building a pipeline that includes scaling the data
#NOTE - THIS IS WHERE I THINK THE PROBLEM IS OCCURRING
#The entire procedure runs quickly if I take out the "scaling" part
lr_pipe = Pipeline([('scaler', StandardScaler()), ('LR', model)])
param_set = [{'LR__penalty': ['l1','l2'],
'LR__C': [100000, 10000, 1000, 100, 10, 1]}]
#Define and conduct the grid search
search = GridSearchCV(lr_pipe, param_set, scoring='f1', cv=cv_inner, refit=True)
#Get the best performing model & evaluate on the whole training set
result = search.fit(X_train, y_train)
best_model = result.best_estimator_
#Evaluate model on the hold out dataset
yhat = best_model.predict(X_test)
#Calculating the F1 score
f1 = f1_score(y_test, yhat)
outer_results_f1.append(f1)
print('>F1=%.3f, est=%.3f, cfg=%s' % (f1, result.best_score_, result.best_params_))