Difference in performance Sigmoid vs. Softmax

Question

For the same Binary Image Classification task, if in the final layer I use 1 node with Sigmoid activation function and binary_crossentropy loss function, then the training process goes through pretty smoothly (92% accuracy after 3 epochs on validation data).

However, if I change the final layer to 2 nodes and use the Softmax activation function with sparse_categorical_crossentropy loss function, then the model doesn't seem to learn at all and stuck at 55% accuracy (the ratio of the negative class).

Is this difference in performance normal? I thought for a binary classification task, Sigmoid with Binary Crossentropy and Softmax with Sparse Categorical Crossentropy should output similar if not identical results? Or did I do something wrong?

Note: I use Adam optimizer and there is a single label column containing 0s and 1s.

Edit: Code for the 2 cases

Case 1: Sigmoid with binary_crossentropy

def addTopModelMobilNetV1(bottom_model, num_classes):
    top_model = bottom_model.output
    top_model = layers.GlobalAveragePooling2D()(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(512, activation='relu')(top_model)
    top_model = layers.Dense(1, activation='sigmoid')(top_model)
    
    return top_model

fc_head = addTopModelMobilNetV1(mobilnet_model, num_classes)
model = Model(inputs=mobilnet_model.input, outputs=fc_head)
# print(model.summary())

earlystopping_cb = callbacks.EarlyStopping(patience=3, restore_best_weights=True)
model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])
history = model.fit_generator(generator=train_generator, 
                              steps_per_epoch=train_df.shape[0]//TRAIN_BATCH_SIZE, 
                              validation_data = val_generator,
                              epochs = 10,
                              callbacks = [earlystopping_cb]
                              )

Case 2: Softmax with sparse_categorical_crossentropy

def addTopModelMobilNetV1(bottom_model, num_classes):
    top_model = bottom_model.output
    top_model = layers.GlobalAveragePooling2D()(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(512, activation='relu')(top_model)
    top_model = layers.Dense(2, activation='softmax')(top_model)
    
    return top_model

fc_head = addTopModelMobilNetV1(mobilnet_model, num_classes)
model = Model(inputs=mobilnet_model.input, outputs=fc_head)

earlystopping_cb = callbacks.EarlyStopping(patience=3, restore_best_weights=True)

model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])

history = model.fit_generator(generator=train_generator, 
                                  steps_per_epoch=train_df.shape[0]//TRAIN_BATCH_SIZE, 
                                  validation_data = val_generator,
                                  epochs = 10,
                                  callbacks = [earlystopping_cb]
                                  )

Why should these different activation functions give similar results? — Nikos M., Jun 28 '21 at 19:32
I think you might read thoroughly the answers in this page. Believe me you will find the answer: ```https://stats.stackexchange.com/questions/233658/softmax-vs-sigmoid-function-in-logistic-classifier``` — , Jun 28 '21 at 19:37
@NikosM. Because the softmax function is an extension of sigmoid that works for any number of classes >= 2 and not just 2. For binary classification (2 classes), they are the same. — Eric Cartman, Jun 28 '21 at 19:46
@Hamzah I checked out the link and it does confirm my confusion since for 2 classes softmax and sigmoid are identical. Did I use the softmax activation incorrectly somehow? — Eric Cartman, Jun 28 '21 at 19:47
Can you elaborate how you get the predicted class when using 2 final nodes with softmax? — Nikos M., Jun 29 '21 at 08:24
@NikosM. I added the code. I just tried to fit it and look at the training result. I've also tried to test a random image with the model and it outputs the correct format, for example [0.01, 0.99]. — Eric Cartman, Jun 29 '21 at 19:23
Yes I see you use a pretrained model and add layers on top. Seems strange.. — Nikos M., Jun 29 '21 at 19:41
@NikosM. "top" actually refers to the output layers, so I didn't include top and add a custom top to match my dataset's number of classes, be able to change activation function, etc. I thought this is standard for transfer learning? — Eric Cartman, Jun 29 '21 at 20:27

score 0 · Answer 1 · 2021-06-28T21:02:31.537

It is based on the output classes if they are mutually exclusive or not. For example in a multi-label classification problem, we use multiple sigmoid functions for each output because it is considered as multiple binary classification problems.

But if the output classes are mutually exclusive. In this case, the best choice is to use softmax, because it will give a probability for each class and summation of all probabilities = 1. For instance, if the image is a dog, the output will be 90% a dag and 10% a cat.

In binary classification, the only output is not mutually exclusive, we definitely use the sigmoid function. Because there are no other classes to apply the Mutual exclusivity.

You can find a summary here: https://stackoverflow.com/a/55936594/16310106

The output of Binary classification should be mutually exclusive no? It can only be 0 or 1 and not both at the same time. I think you're confusing this with multi-label classification (where you need to use sigmoid instead of softmax since the outputs are not mutually exclusive). I understand we can use Sigmoid for binary classification, but why can't we use the Softmax activation function for binary classification? Mathematically it should work right? — Eric Cartman, Jun 28 '21 at 20:37

Difference in performance Sigmoid vs. Softmax

1 Answers1