1

Hello StackExchange community,

I am trying to apply Nested Cross Validation on a pipeline to get a reliable estimate of the generalization error of my model. The pipeline includes two steps:

  1. Scaling the data
  2. Building a logistic regression model.

The issue I am experiencing is as follows: When I run code below with scaling as part of the pipeline, the first outer loop runs quickly, i.e. prints the results in a maximum of 1-2 minutes. However, something gets stuck on the second outer loop. I have tried this various times, and the outer loop runs for hours and still does not print any results. I am unclear on why.

To evaluate the code and pipeline, when I try without scaling, the entire nested cross validation procedure runs within minutes on my computer.

I have pasted my code below, but please note I have replaced a reference to my actual dataset with a synthetic dataset, as the actual dataset is proprietary. I do not experience any runtime issues on the synthetic dataset, but I suspect that this is because the make_classification code already produces scaled data.

Any advice that anyone could provide would be much appreciated. I know NCV is much more computationally expensive, but I am confused why:

  • It runs relatively quickly without feature scaling and
  • It runs the first outer loop iteration with scaling relatively quickly, but the second (and presumably subsequent ones) for an eternity.

Thank you!

#Importing libraries
from numpy import mean
from numpy import std
import sklearn.metrics as metrics
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X_combined, Y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

#Setting up the outer cross validation procedure
cv_outer = KFold(n_splits=2, shuffle=True, random_state=1)
outer_results_f1 = list()

for train_ix, test_ix in cv_outer.split(X_combined):
    #Split data
    X_train, X_test = X_combined[train_ix, :], X_combined[test_ix, :]
    y_train, y_test = Y[train_ix], Y[test_ix]

    #Setting up the inner cross validation procedure
    cv_inner = RepeatedStratifiedKFold(n_splits=2, n_repeats=1, random_state=1)

    #Initializing the model 
    model = LogisticRegression(solver = 'liblinear', random_state=1, max_iter=3000)
    
    #Building a pipeline that includes scaling the data
    
    #NOTE - THIS IS WHERE I THINK THE PROBLEM IS OCCURRING
    #The entire procedure runs quickly if I take out the "scaling" part
    lr_pipe = Pipeline([('scaler',  StandardScaler()), ('LR', model)])
        
    param_set = [{'LR__penalty': ['l1','l2'],
                  'LR__C': [100000, 10000, 1000, 100, 10, 1]}]

    #Define and conduct the grid search
    search = GridSearchCV(lr_pipe, param_set, scoring='f1', cv=cv_inner, refit=True)
    
    #Get the best performing model & evaluate on the whole training set
    result = search.fit(X_train, y_train)
    best_model = result.best_estimator_

    #Evaluate model on the hold out dataset
    yhat = best_model.predict(X_test) 

    #Calculating the F1 score
    f1 = f1_score(y_test, yhat)
    outer_results_f1.append(f1)    
    print('>F1=%.3f, est=%.3f, cfg=%s' % (f1, result.best_score_, result.best_params_))
Danby
  • 11
  • 1

0 Answers0