For sklearn ML algorithms, is it possible to use boolean data alongside continuous data for the predictive data, and if so how can the data be scaled?

Question

I have a medium size data set (7K) of patient age, sex, and pre-existing conditions. Age of course is from 0-101, sex is 1 for male, 2 for female, and -1 for diverse. All the pre-conditions are Boolean. The outcome, death is also Boolean.

Regardless of how I scale the data (I tried lots of scalers), I always get a warning:

FitFailedWarning: Estimator fit failed. The score on this train-test partition for 
                  these parameters will be set to nan.

This traces back to: ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'unknown'

If I take out the age and sex columns, the error goes away. There are definitely no text, missing, or weird values here.

If I look at my rescaled data, it looks as I would expect it to look.

If I drastically simplify the data, it works.

import numpy as np
import pandas as pd
from sklearn import preprocessing

array = np.array([[42, 1, False, False, False, False, False, False, False, False, False, False, False],\
        [72, 1, False, False, True, False, False, False, False, False, False, True, False],\
        [77, 2, False, False, False, False, False, False, False, False, True, True, False],\
        [36, 1, False, False, False, False, False, False, False, False, False, False, False],\
        [42, 1, False, False, False, False, False, False, False, False, True, False, False],\
        [82, 1, False, False, False, True, False, False, False, False, False, True, False],\
        [71, 2, False, False, False, False, False, False, False, False, False, True, False],\
        [36, -1, False, False, False, False, False, False, False, False, True, False, False],
        [52, 1, False, False, False, False, False, False, False, False, False, False, False],\
        [52, 1, False, False, False, False, False, False, False, True, False, True, True],\
        [77, 2, False, False, False, False, False, False, True, False, True, True, False],\
        [46, 1, False, False, False, False, False, False, False, False, False, False, False],\
        [45, 1, False, False, False, False, False, False, False, False, False, False, False],\
        [88, 1, False, False, False, False, False, True, False, False, False, True, True],\
        [79, 2, False, True, True, False, False, False, False, False, False, True, True],\
        [36, -1, True, False, False, False, False, False, False, False, False, False, False]])

X = array[:,0:12]
Y = array[:,12]
scaler = preprocessing.MinMaxScaler().fit(X)
rescaledX = scaler.transform(X)
    
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=3, shuffle=True, random_state=7)   # split the data into training and test sets for k-fold validation
model = LogisticRegression(solver='lbfgs')             # set up model of a linear regression
results = cross_val_score(model, rescaledX, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

It would be awesome if someone has an idea of what might be wrong, or how to troubleshoot further.

The value error is saying your _labels_, i.e. y-values, are errant. Does it work without `cross_val_score`? If so, set `error_score="raise"` in `cross_val_score` so you get the full traceback; if not, you can get the traceback for the model itself without cross-validation. In either case, edit the full error traceback into the question, and say some more about your y-values. — Ben Reiniger, Sep 16 '21 at 16:22

score 1 · Accepted Answer · answered Sep 16 '21 at 14:50

So, to me what you have to do is :

Transform all your your True/False to 1/0, so they're numerical. Keep age as it is (or use some normalisation, but not that necessary
Absolutely change the way Sex is handled. You have a big bias since you have 3 values : Since it's numerical, distance matters. Here, distance between "Male" and "Diverse" is 2, and distance between "Female" and "Diverse" is 3. There's no logical reason, seeing your problem, for that. This will bring bias to your model.

You should read this answer : https://datascience.stackexchange.com/a/79575/101580 In your case One Hot Encoder is good enough since you have 3 values.

Yeah, so the key was to really convert my booleans into 0s or 1s. I thought Python considered them as 0s or 1s anyway. I thought they were 1s or 0s with MaxAbsScaler. OneHotEncoder might work on the sex issue, but had iteration issues, I think with the age data. Interation issues also if not normalized. So I'm converting bools to 1s and 0s and using MaxAbsScaler - thanks. — DrWhat, Sep 17 '21 at 11:01

For sklearn ML algorithms, is it possible to use boolean data alongside continuous data for the predictive data, and if so how can the data be scaled?

1 Answers1