I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with no avail. Finally, I noticed that a change in the learning rate made an unbelievable difference in my weights.
Using a learning rate of .001 (which I thought was pretty conservative), the minimize function would actually exponentially raise the loss. After one epoch the loss could jump from a number in the thousands to a trillion and then to infinity ('nan'). When I lowered the learning rate to .0001, everything worked fine.
1) Why does a single order of magnitude have such an effect?
2) Why does the minimize function literally perform the opposite of its function and maximize the loss? It seems to me that that shouldn't occur, no matter the learning rate.