I want to produce learning curves for three regression models run on data containing 200 samples, 10 features and 1 target variable.
The target variable contains two clusters/peaks, making it imbalanced between and within the clusters, so I applied a stratified split to divide the data into training and test sets, using train_test_split, in the following manner:
# Stratified Split dataset into Training and Testing
bins = np.linspace(0, 1.01, 10)
y_binned = np.digitize(Y_scaled, bins)
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y_scaled, stratify=y_binned, test_size=0.3, random_state=0)
I then trained and tested each models, using stratified K-fold cross-validation for hyperparameter optimisation.
I would now like to use leaning curves to examine verify that my models are not overfitting, and that the training and test scores have converged. For this I am turning to sklearn.model_selection.learning_curve.
In the documentation for sklearn's learning curve it says the following:
Learning curve.
Determines cross-validated training and test scores for different training set sizes.
A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.
I implemented the learning curve using the function defined in Plotting Learning CUrves (with an additional 'scorer' variable) and the following code:
cv = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
title = "Learning Curves (Linear Regression)"
estimator = model_1
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
ylim=(0.1, 1.01), cv=cv, n_jobs=-1)
title = "Learning Curves (Ridge Regression)"
estimator = model_2
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
ylim=(0.1, 1.01), cv=cv, n_jobs=-1)
title = "Learning Curves (Random Forest - Extra Trees)"
estimator = model_3
plot_learning_curve(estimator, title, r2_score, X_train_0, Y_train_0, train_sizes=np.linspace(0.1, 1.0, 10), \
ylim=(0.1, 1.01), cv=cv, n_jobs=-1)
plt.show()
According to the code:
train_sizes=np.linspace(0.1, 1.0, 10
and
cv = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
I expected that the training dataset (X_train_0) would be split into 10 sub-sets, that each sub-set would be stratified according to the whole (as it was during training and testing), and that each of the ten sub-sets would be then split into stratified training/test set, with the training set undergoing 5-fold stratified k-fold cross validation.
I passed the training data X_train_0 and targets Y_train_0 to the learning curve function, expecting to obtain mean training and testing scores for each sub-set, however I received the following error:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
At this point, I can't figure out how I can pass continuous variables for the target of the learning curve the estimator, while producing stratified sub-sets of the training dataset.