4

I have my ANN trained on MNIST dataset. Hidden layer has 128 neurons and input layer has 784 neurons. This gave me an accuracy of 94%. However when I added one more layer with 64 neurons in each then the accuracy significantly reduced to 35%. What could be the reason behind this.

Edit : Activation function : sigmoid. 521 epochs.

Pink
  • 41
  • 1
  • 3

2 Answers2

2

The reason is that by adding more layers, you've added more trainable parameter to your model. You have to train it more. You should consider that MNIST data set is a very easy-to-learn dataset. You can have to layers with much less number of neurons in each layer. Try $10$ neurons for each to facilitate the learning process. You can reach to $100%$ accuracy.

Green Falcon
  • 13,868
  • 9
  • 55
  • 98
0

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

NOTE: The link provided for 'Vanishing Gradient' explains beautifully why increasing layers make your network more susceptible to saturation of learning.

The vanishing gradient problem makes sure your Neural Neyt is trapped in a non optimal solution. While the high learning rate ensures that you get trapped in the non optimal solution. In short the high learning rate after a few oscillations will push your network to saturation.

Solution:

  • Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.
  • Use an adaptive optimizer like AdaGrad, Adam or RMSProp.
  • Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.
DuttaA
  • 793
  • 6
  • 23
  • 1
    Has nothing to do with this, it's the size of the network compared to the amount of data available. – Matthieu Brucher Oct 28 '18 at 13:41
  • @MatthieuBrucher what exactly do you mean by size? – DuttaA Oct 28 '18 at 13:43
  • The size of the network (number of layers + number of nodes per layers). – Matthieu Brucher Oct 28 '18 at 13:43
  • @MatthieuBrucher yes adding a layer makes it more prone to less learning via vanishing gradient, check the link of vanishing gradient...i did not add it my answer because the answer given was great, however I will indicate it in my answer – DuttaA Oct 28 '18 at 13:44
  • I know what a vanishing gradient is... You don't know what OP uses for the training, and the number of epochs is a proof that you didn't read the question. – Matthieu Brucher Oct 28 '18 at 13:46
  • @MatthieuBrucher what OP uses for the training? what do u mean by that? also how number of epochs indicate what? I am at a loss here..can you explain your stand more clearly – DuttaA Oct 28 '18 at 13:48
  • The number of epochs is 521, not 1 million. And you don't know what kind of optimizer is used. If you need to know, ask as a comment first, same for learning rate. – Matthieu Brucher Oct 28 '18 at 13:50
  • @MatthieuBrucher I am pretty sure the the learning rate is not in the order I provided, second I provided it as a global solution not localised to just the OP's problem...nitpicking to downvote is not something I would not suggest...also the link tells why stacking layers might lead to bad learning, I do not think you checked the link – DuttaA Oct 28 '18 at 13:53