Binary classification works with softmax, but not sigmoid

Question

I am doing a binary classification problem for seizure classification. I split the data into Training, Validation and Test with the following sizes and shapes dataset_X = (154182, 32, 9, 19), dataset_y = (154182, 1).

The unique values for dataset_y are array([0, 1]), array([77127, 77055]) Then the data is split into to become 92508, 30837 and 30837 for Training, Validation and Testing respectively.

The configuration using Categorical_CrossEntropy with a final dense layer with size of 2 and softmax activation function works very well. However, if I tried to used Binary_CrossEntropy with a final dense layer with size of 1 and sigmoid activation function, the training and validation phase reports almost the same results, but when predicting on test dataset, it is totally messed up.

For the softmax model:

The Model:

def create_cnn_model(X_train_shape, nb_classes):

    inputs = Input(shape=X_train_shape[1:])

    normal1 = BatchNormalization(axis=-1)(inputs)
    reshape1 = Lambda(lambda x: keras.backend.expand_dims(x, axis=-1))(normal1)
    conv1 = Convolution3D(
        32, (3 ,3, X_train_shape[-1]), data_format = 'channels_last',
        padding='valid', strides=(1,1,1))(reshape1)

    reshape2 = Lambda(lambda x: keras.backend.squeeze(x, axis=-2))(conv1)

    relu1 = Activation('relu')(reshape2)
    pool1 = MaxPooling2D(pool_size=(2, 1), data_format = 'channels_last')(relu1)

    normal2 = BatchNormalization(axis=-1)(pool1)

    conv2 = Convolution2D(
        64, (3, 3), data_format = 'channels_last',
        padding='valid', strides=(1,1))(normal2)
    relu2 = Activation('relu')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 1), data_format = 'channels_last')(relu2)

    normal3 = BatchNormalization(axis=-1)(pool2)

    
    conv3 = Convolution2D(
        64, (3, 3), data_format = 'channels_last',
        padding='valid', strides=(1,1))(normal3)
    relu3 = Activation('relu')(conv3)
    
    flat = Flatten()(relu3)
    drop1 = Dropout(0.5)(flat)
    dens1 = Dense(256, activation='relu')(drop1)
    drop2 = Dropout(0.5)(dens1)
    dens2 = Dense(nb_classes)(drop2)
 
    last = Activation('softmax')(dens2)


    model = Model(inputs=inputs, outputs=last)
    return model

The functions that create the model and initiates the training

        cnn_model = create_cnn_model(X_train.shape, nb_classes)
        adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        cnn_model.compile(loss='categorical_crossentropy', 
                    optimizer=adam, 
                    metrics=['accuracy', 'Recall''Precision','AUC'])
        Y_train = Y_train.astype('uint8')
        Y_train = np_utils.to_categorical(Y_train, nb_classes)
        Y_val = np_utils.to_categorical(Y_val, nb_classes)


        cnn_model.fit(X_train, Y_train, batch_size=32, epochs=10, validation_data=(X_val,Y_val))

        predictions = cnn_model.predict(X_test, verbose=1)
        y_pred = np_utils.to_categorical(np.argmax(predictions, axis=1), nb_classes)
        y_true = np_utils.to_categorical(Y_test, nb_classes)
        
        #Converting categorical to numerical
        y_pred_s = y_pred.argmax(1)
        y_true_s = y_true.argmax(1)
        
        print(np.unique(y_pred_s, return_counts=True))
        print(np.unique(y_true_s, return_counts=True))
        
        print(y_pred.shape, y_true.shape)
        from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, roc_auc_score
        acc_scr = accuracy_score(y_true, y_pred)
        pre_scr = precision_score(y_true, y_pred, average='micro')
        rec_scr = recall_score(y_true, y_pred, average='micro')
        roc_auc_score = roc_auc_score(y_true, y_pred, average='micro')

        f1_test = f1_score(y_true, y_pred, average='weighted')

The training details and testing results after 10 epochs:

Shape: x_train, y_train, X_val, y_val
(92508, 32, 9, 19) (92508, 2) (92508, 32, 9, 19) (30837, 2)
Epoch 1/10
2891/2891 [==============================] - 63s 19ms/step - loss: 0.8718 - accuracy: 0.8860 - recall: 0.8860 - precision: 0.8860 - auc: 0.9474 - val_loss: 0.1635 - val_accuracy: 0.9414 - val_recall: 0.9414 - val_precision: 0.9414 - val_auc: 0.9824
Epoch 2/10
2891/2891 [==============================] - 53s 18ms/step - loss: 0.3728 - accuracy: 0.9361 - recall: 0.9361 - precision: 0.9361 - auc: 0.9813 - val_loss: 0.1891 - val_accuracy: 0.9251 - val_recall: 0.9251 - val_precision: 0.9251 - val_auc: 0.9791
...
Epoch 10/10
2891/2891 [==============================] - 48s 17ms/step - loss: 0.1377 - accuracy: 0.9774 - recall: 0.9774 - precision: 0.9774 - auc: 0.9967 - val_loss: 0.0354 - val_accuracy: 0.9864 - val_recall: 0.9864 - val_precision: 0.9864 - val_auc: 0.9986
964/964 [==============================] - 3s 3ms/step
Shape: X_test, y_test, y_pred
(30837, 32, 9, 19) (30837, 2) (30837, 2)
Accuracy:  0.9854719979245712
Recall:  0.9854719979245712
Precision:  0.9854719979245712
ROC AUC:  0.9854719979245712

For the sigmoid model:

The Model: It is the same model as above but with the following changes:

    dens2 = Dense(1)(drop2)
    last = Activation('sigmoid')(dens2)

The functions that create the model and initiates the training

        cnn_model = create_cnn_model(X_train.shape, nb_classes) #nb_classes is useless here
        adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        cnn_model.compile(loss='binary_crossentropy', 
                    optimizer=adam, 
                    metrics=['accuracy', 'Recall', 'Precision','AUC'])
        cnn_model.fit(X_train, Y_train, batch_size=32, epochs=10, validation_data=(X_val,Y_val))

        predictions = cnn_model.predict(X_test, verbose=1)
        y_pred = np.argmax(predictions)
        y_true = Y_test

        print(y_pred.shape, y_true.shape)
        from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, roc_auc_score
        acc_scr = accuracy_score(y_true, y_pred)
        pre_scr = precision_score(y_true, y_pred)
        rec_scr = recall_score(y_true, y_pred)
        roc_auc_score = roc_auc_score(y_true, y_pred)
        f1_test = f1_score(y_true, y_pred, average='weighted')

The training details and testing results after 10 epochs:

Shape: x_train, y_train, X_val, y_val
(92508, 32, 9, 19) (92508, 1) (30837, 32, 9, 19) (30837, 1)
Epoch 1/10
2891/2891 [==============================] - ETA: 0s - loss: 0.0284 - accuracy: 0.9920 - recall: 0.2655 - precision: 0.5381 - auc: 0.9277
2891/2891 [==============================] - 80s 24ms/step - loss: 0.0284 - accuracy: 0.9920 - recall: 0.2655 - precision: 0.5381 - auc: 0.9277 -  val_loss: 0.0156 - val_accuracy: 0.9955 - val_recall: 0.5370 - val_precision: 0.8734 - val_auc: 0.9432 
Epoch 2/10
2891/2891 [==============================] - ETA: 0s - loss: 0.0129 - accuracy: 0.9959 - recall: 0.6269 - precision: 0.8476 - auc: 0.9800
2891/2891 [==============================] - 60s 21ms/step - loss: 0.0129 - accuracy: 0.9959 - recall: 0.6269 - precision: 0.8476 - auc: 0.9800 - val_loss: 0.0079 - val_accuracy: 0.9974 - val_recall: 0.7860 - val_precision: 0.8899 - val_auc: 0.9873 
...
Epoch 10/10
2891/2891 [==============================] - 50s 17ms/step - loss: 0.0853 - accuracy: 0.9660 - recall: 0.9665 - precision: 0.9655 - auc: 0.9952 - val_loss: 0.0865 - val_accuracy: 0.9648 - val_recall: 0.9615 - val_precision: 0.9679 - val_auc: 0.9949
964/964 [==============================] - 3s 3ms/step
Shape: X_test, y_test, y_pred
(30837, 32, 9, 19) (30837, 1) (30837,)
Accuracy:  0.5002432143204592
Recall:  0.0
Precision:  0.0
ROC AUC:  0.5
F1-weighted score: 0.33360360651524557

When printing the y_true and y_pred arrays after running the softmax model, after being converted from categorical to numerical, I get:

y_true:(array([0, 1], dtype=int64), array([15426, 15411], dtype=int64))

y_pred: (array([0, 1], dtype=int64), array([15360, 15477], dtype=int64))

However, when I run the same for the sigmoid model, I get:

y_true: (array([0, 1], dtype=uint8), array([15426, 15411], dtype=int64))

y_pred: (array([0], dtype=int64), array([30837], dtype=int64))

it is apparent that not a single '1' label is predicted. This justifies the scores above. So what does cause this behavior and how to fix it?

Thank you

Have you had a look at the `predictions` variable after running the `predictions = cnn_model.predict(X_test, verbose=1)` step? What shape is this? What are the first few values? — Lynn, Nov 17 '22 at 04:34
@Lynn Thank your for your reply. The first and last 3 values in `predictions` for the **sigmoid** model are `[[1.7531709e-04] [3.0053352e-04] [2.3599964e-04] ... [9.9439640e-04] [2.8151396e-04] [1.7901099e-01]]` — Mohammed Nafie, Nov 17 '22 at 16:04
For the **softmax** model, the `predictions` shape is: `(30837, 2)` with values `[[9.99934316e-01 6.57076234e-05] [9.99946475e-01 5.35781728e-05] [9.99929309e-01 7.07275758e-05] .. [9.99987125e-01 1.28526008e-05] [9.99988794e-01 1.12417965e-05] [9.99265611e-01 7.34335219e-04]]` — Mohammed Nafie, Nov 17 '22 at 16:21
So the shape of the sigmoid `predictions` is (30837, 1) and then you apply `argmax` to it? — Lynn, Nov 17 '22 at 22:11
Yes I am doing this. I believe this will always yeild a '0' no matter what. How is it possible to fix this? Perhaps 2 sigmoid output neurons? Or the main question is how to map the prediction from sigmoid to binary decisions? — Mohammed Nafie, Nov 18 '22 at 13:01
The usual method is to pick a threshold and assign 1 to values above the threshold and 0 to values below the threshold. Choose a threshold of 0.5 if there's no good reason to use anything else, then simply round the values. — Lynn, Nov 18 '22 at 23:07
0.5 as a threshold won't work as all the values in the `prediction` vector are less than 0.5. In this case, on what basis I will set the required threshold? — Mohammed Nafie, Nov 20 '22 at 22:04
To me it seems like there is something about the way the sigmoid model is training that doesn't look right. Compared to the softmax model, the sigmoid model is getting a very low loss and high accuracy on the training data after training for just one epoch. If I had this problem, I'd start by looking at the training data and check (1) that the model training data, including the labels, are correct and (2) the predictions the model makes on the training data make sense, and do the same for the validation data. — Lynn, Nov 22 '22 at 11:20
Thank you for your help Lynn. I will surely revise all the training and validation data along with the labels. However, I am curious about the why this may cause such effect on the training using sigmoid and not softmax, despite the fact that both the training/val data labeling is the same for both? (except I am unaware of what samples are picked during the train/val/test split). And what are the reasons that you let you decide to do (1) and (2)? — Mohammed Nafie, Nov 22 '22 at 23:19
I suggested starting off by looking at your training/validation data because something doesn't look right in the training output for the sigmoid model. The loss increases between epoch 1 and 10, the accuracy decreases, but the recall, precision, and AUC all increase. This is very different from the softmax model, which follows what you'd expect to see - the loss decreasing and the accuracy increasing. So looking at your training data, and the model predictions for the training data might help you work out why this is happening. — Lynn, Nov 23 '22 at 11:02

Binary classification works with softmax, but not sigmoid

0 Answers0