Neural network example not working with sigmoid activation function

Question

I'm running the Neural Network example written in in BogoToBogo

The program worked fine:

(array([0, 0]), array([  2.55772644e-08]))
(array([0, 1]), array([ 0.99649732]))
(array([1, 0]), array([ 0.99677086]))
(array([1, 1]), array([-0.00028738]))

The neural network learned XOR, using tanh as activation function by default. However, after I changed the activation function to "sigmoid"

nn = NeuralNetwork([2,2,1], 'sigmoid')

Now the program outputs:

epochs: 0
...
epochs: 90000
(array([0, 0]), array([ 0.45784467]))
(array([0, 1]), array([ 0.48245772]))
(array([1, 0]), array([ 0.47365194]))
(array([1, 1]), array([ 0.48966856]))

The output for the 4 inputs are all near 0.5. The result shows that neural network (with the sigmoid function) didn't learn XOR.

I was expecting the program would output:

~0 for (0, 0) and (1, 1)
~1 for (0, 1) and (1, 0)

Can somebody explain why this example with sigmoid doesn't work with XOR?

Without any modification, it output (array([0, 0]), array([ 2.55772644e-08])) (array([0, 1]), array([ 0.99649732])) (array([1, 0]), array([ 0.99677086])) (array([1, 1]), array([-0.00028738])). I updated the question to include what worked fine. — suztomo, Aug 29 '18 at 04:26
Could you try running the program multiple times (remove any fixed RNG seed if it has one). Does it always get stuck, or maybe just some of the time? I am asking because with the simplest implementations, it is actually fairly common for this problem to get stuck, and not necessarily a fault in your code. — Neil Slater, Aug 29 '18 at 08:13
Well apparently I gave the wrong answer here I found a link for your troubled https://datascience.stackexchange.com/questions/11589/creating-neural-net-for-xor-function — DuttaA, Aug 30 '18 at 08:17
@NeilSlater pardon me for being a skeptic but can you somehow tell me with what combination of weights can you actually achieve that? All I can see is 2 decision boundaries are required and sigmoid can only supply 1...What am I missing here? — DuttaA, Aug 30 '18 at 08:37
@DuttA: The possibility of handling multiple decision boundaries has very little to do with choice of activation function. A large number of non-linear functions will do just fine for that, including sigmoid. This is related to the [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) and XOR is a commonly-used example to demonstrate it. All you need is at least one hidden layer with enough neurons in it. Possibly what you are missing mathematically is using *bias*? — Neil Slater, Aug 30 '18 at 08:43
@NeilSlater as far as I see this is a classification problem with 0 as the threshold for sigmoid..Now we need 3 regions or 2 decision boundaries to separate them..This is probably achieved by the hidden layer, but if we have a single node in the final layer how will it be able to differentiate between the 3 regions..To me it seems impossible — DuttaA, Aug 30 '18 at 08:49
@DuttaA: It is definitely possible. If it helps, you can think of the hidden layer as *mapping* the 3-part space into a 2-part one that can be separated linearly (you can plot the outputs of the hidden layer, and visualise this - it actually moves the points of XOR around so inputs that classify as 1s and 0s can be separated by a single line in the plane). I suggest you read about the universal approximation theorem. — Neil Slater, Aug 30 '18 at 09:08

suztomo · Accepted Answer · 2018-09-07T19:24:40.897

I found the answer by myself. The reason of the difference is that the definition of prime of tanh in BogoToBogo (tanh_prime) takes arguments that's already applied with activation function:

def tanh_prime(x):
    return 1.0 - x**2

while sigmoid_prime is not. It calls sigmoid in it:

def sigmoid_prime(x):
    return sigmoid(x)*(1.0-sigmoid(x))

So the definition of sigmoid_prime seems more accurate than tanh_prime. Then why not sigmoid is working? It's because their parameters are already applied with the activation function.

Background

The derivatives of sigmoid ($\sigma$) and tanh share the same attribute in which these derivatives can be expressed in terms of sigmoid and tanh functions themselves.

$$ \frac{d\tanh (x)}{d(x)} = 1 - \tanh (x)^2 $$ $$ \frac{d\sigma (x)}{d(x)} = \sigma(x) (1 - \sigma(x)) $$

When performing backpropagation to adjust their weights, neural networks apply the derivative ($g^{'}$) to the values that's before applied with activation function. In BogoToBogo's explanation, that's variable $ z^{(2)} $ in

$$ \delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} \cdot g^{'}(z^{(2)}). $$

In its source code, the variable dot_value holds such values. The Python implementation, however, calls the derivative with the vector stored in variable a. The vector is after applied with activation function. Why?

I interpret this as optimization to leverage the fact that derivatives of sigmoid and tanh use their parameters only to apply the original function. As the neural network already holds the value after activation function (as a), it can skip unnecessary calculation of calling sigmoid or tanh when calculating the derivatives. That's why the definition of tanh_prime in BogoToBogo does NOT call original tanh within it. However, the definition of sigmoid_prime, on the other hand, calls sigmoid function unexpectedly, resulting in miscalculation of derivative function.

Solution

Once I define sigmoid_prime in such a way that it assumes the parameter is already applied with sigmoid, then it works fine.

def sigmoid_prime(x):
    return x*(1.0-x)

Then calling the implementation with

nn = NeuralNetwork([2,2,1], 'sigmoid', 500000)

successfully outputs:

(array([0, 0]), array([ 0.00597638]))
(array([0, 1]), array([ 0.99216467]))
(array([1, 0]), array([ 0.99332048]))
(array([1, 1]), array([ 0.00717885]))

Neural network example not working with sigmoid activation function

1 Answers1

Background

Solution