7

I’m new to machine learning and recently facing a problem on back propagation of training a neural network using ReLU activation function shown in the figure. My problem is to update the weights matrices in the hidden and output layers.

The cost function is given as:

$J(\Theta) = \sum\limits_{i=1}^2 \frac{1}{2} \left(a_i^{(3)} - y_i\right)^2$

where $y_i$ is the $i$-th output from output layer.

enter image description here

Using the gradient descent algorithm, the weights matrices can be updated by:

$\Theta_{jk}^{(2)} := \Theta_{jk}^{(2)} - \alpha\frac{\partial J(\Theta)}{\partial \Theta_{jk}^{(2)}}$

$\Theta_{ij}^{(3)} := \Theta_{ij}^{(3)} - \alpha\frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(3)}}$

I understand how to update the weight matrix at output layer $\Theta_{ij}^{(3)}$, however I don’t know how to update that from the input layer to hidden layer $\Theta_{jk}^{(2)}$ involving the ReLU activation units, i.e. not understanding how to get $\frac{\partial J(\Theta)}{\partial \Theta_{jk}^{(2)}}$.

Can anyone help me understand how to derive the gradient on the cost function...?

kelvincheng
  • 71
  • 1
  • 2

2 Answers2

2

Have a look at this post. I found it quite useful when starting out with neural networks.

http://neuralnetworksanddeeplearning.com/chap2.html

RonsenbergVI
  • 979
  • 3
  • 10
  • Thanks for your reference! That helps with a clearer picture of understanding, particularly for the calculus part. – kelvincheng Nov 14 '19 at 02:09
1

The derivative of a ReLU is:

$$\frac{\partial ReLU(x)}{\partial x} = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \\ \end{cases} $$

So its value is set either to 0 or 1. It's not defined at 0, there must be a convention to set it either at 0 or 1 in this case.

To my understanding, it means that the error is either fully propagated to the previous layer (1), or completely stopped (0).

Leevo
  • 6,005
  • 3
  • 14
  • 51
  • I understand the special derivative for the ReLU function. But sorry for the unclear of my question that I want to understand the calculus part of the partial derivative, i.e. $\frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(x)}}$, for some $x$-th layers $i$ and $j$. Thanks anyway! – kelvincheng Nov 14 '19 at 02:06
  • maybe late, but this means that if we have a single final neuron that uses relu, if it predict 0, there is no "error propagation", since for backprop we always multiply by the derivative of the activation? –  Jun 30 '22 at 16:45