6

The paper I read is Glorot et al (2010). And the math part is in Section 4.2.1. Formula (5) and (10) make sense to me but I cannot derive formula (6) and (7) myself from (2) and (3).

I found many tutorials on the internet used the formula $$Var[XY] = Var[X]Var[Y] + (E[X])^2 Var[Y] + Var[X](E[Y])^2$$ which requires the independence between X and Y.

But in formula (2) and (3) the gradients are not independent of W and Z, because all of them are related to each other through the output from the last layer.

I would appreciate it if anyone can give me a derivation of the formula (6) and (7). Thanks in advance.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
Jason
  • 61
  • 2

0 Answers0