Why can't continuous variables be used for the estimator of learning curves, when using StratifiedKFold to split the dataset?

Question

I want to produce learning curves for three regression models run on data containing 200 samples, 10 features and 1 target variable.

The target variable contains two clusters/peaks, making it imbalanced between and within the clusters, so I applied a stratified split to divide the data into training and test sets, using train_test_split, in the following manner:

# Stratified Split dataset into Training and Testing
bins = np.linspace(0, 1.01, 10)
y_binned = np.digitize(Y_scaled, bins)

X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y_scaled, stratify=y_binned, test_size=0.3, random_state=0)

I then trained and tested each models, using stratified K-fold cross-validation for hyperparameter optimisation.

I would now like to use leaning curves to examine verify that my models are not overfitting, and that the training and test scores have converged. For this I am turning to sklearn.model_selection.learning_curve.

In the documentation for sklearn's learning curve it says the following:

Learning curve.

Determines cross-validated training and test scores for different training set sizes.

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.

I implemented the learning curve using the function defined in Plotting Learning CUrves (with an additional 'scorer' variable) and the following code:

cv = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)

title = "Learning Curves (Linear Regression)"
estimator = model_1
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
                    ylim=(0.1, 1.01), cv=cv, n_jobs=-1)

title = "Learning Curves (Ridge Regression)"
estimator = model_2
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
                    ylim=(0.1, 1.01), cv=cv, n_jobs=-1)

title = "Learning Curves (Random Forest - Extra Trees)"
estimator = model_3
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
                    ylim=(0.1, 1.01), cv=cv, n_jobs=-1)

plt.show()

According to the code: train_sizes=np.linspace(0.1, 1.0, 10 and cv = StratifiedKFold(n_splits=5, shuffle=False, random_state=0) I expected that the training dataset (X_train_0) would be split into 10 sub-sets, that each sub-set would be stratified according to the whole (as it was during training and testing), and that each of the ten sub-sets would be then split into stratified training/test set, with the training set undergoing 5-fold stratified k-fold cross validation.

I passed the training data X_train_0 and targets Y_train_0 to the learning curve function, expecting to obtain mean training and testing scores for each sub-set, however I received the following error:

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

At this point, I can't figure out how I can pass continuous variables for the target of the learning curve the estimator, while producing stratified sub-sets of the training dataset.

Why can't continuous variables be used for the estimator of learning curves, when using StratifiedKFold to split the dataset?

0 Answers0