Gradient Descent in ReLU Neural Network

Question

I’m new to machine learning and recently facing a problem on back propagation of training a neural network using ReLU activation function shown in the figure. My problem is to update the weights matrices in the hidden and output layers.

The cost function is given as:

$J(\Theta) = \sum\limits_{i=1}^2 \frac{1}{2} \left(a_i^{(3)} - y_i\right)^2$

where $y_i$ is the $i$-th output from output layer.

Using the gradient descent algorithm, the weights matrices can be updated by:

$\Theta_{jk}^{(2)} := \Theta_{jk}^{(2)} - \alpha\frac{\partial J(\Theta)}{\partial \Theta_{jk}^{(2)}}$

$\Theta_{ij}^{(3)} := \Theta_{ij}^{(3)} - \alpha\frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(3)}}$

I understand how to update the weight matrix at output layer $\Theta_{ij}^{(3)}$, however I don’t know how to update that from the input layer to hidden layer $\Theta_{jk}^{(2)}$ involving the ReLU activation units, i.e. not understanding how to get $\frac{\partial J(\Theta)}{\partial \Theta_{jk}^{(2)}}$.

Can anyone help me understand how to derive the gradient on the cost function...?

if you found the solution .. can u please share it with us ? — IbraHim M. Nada, Nov 11 '21 at 11:50

score 2 · Answer 1 · answered Sep 03 '19 at 17:53

2

Have a look at this post. I found it quite useful when starting out with neural networks.

http://neuralnetworksanddeeplearning.com/chap2.html

answered Sep 03 '19 at 17:53

RonsenbergVI

979
3
10

Thanks for your reference! That helps with a clearer picture of understanding, particularly for the calculus part. – kelvincheng Nov 14 '19 at 02:09

score 1 · Answer 2 · answered Oct 04 '19 at 07:49

1

The derivative of a ReLU is:

$$\frac{\partial ReLU(x)}{\partial x} = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \\ \end{cases} $$

So its value is set either to 0 or 1. It's not defined at 0, there must be a convention to set it either at 0 or 1 in this case.

To my understanding, it means that the error is either fully propagated to the previous layer (1), or completely stopped (0).

answered Oct 04 '19 at 07:49

Leevo

6,005
3
14
51

I understand the special derivative for the ReLU function. But sorry for the unclear of my question that I want to understand the calculus part of the partial derivative, i.e. $\frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(x)}}$, for some $x$-th layers $i$ and $j$. Thanks anyway! – kelvincheng Nov 14 '19 at 02:06
maybe late, but this means that if we have a single final neuron that uses relu, if it predict 0, there is no "error propagation", since for backprop we always multiply by the derivative of the activation? – Jun 30 '22 at 16:45

Gradient Descent in ReLU Neural Network

2 Answers2