Conceptual question on CNN and any multi layer neural network (Part 2)

Question

I have read a number of tutorials and online lectures (https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) but none of them mention the rationale for selecting a particular design. How do we decide on the following design aspects?

1) Is there a rule of thumb for deciding on the number of layers? Or is it purely on the basis of trial and error?

2) Can somebody please explain the intuition and the rationale for designing a CNN architecture for this example -- considering a binary classification problem. For an input RGB image of size 500*500*3, how would you design the architecture -- how many layers, number of filters, size of the filter, how much is the stride, etc.

score 3 · Accepted Answer · answered Sep 19 '18 at 16:22

There's an amazing book called "Deep Learning with Python" by Francois Chollet that you could refer to. To answer your question:

You usually add high number of layers and check where the validation accuracy halts and training loss decreases rapidly. This means your network starts to overfit and it's easy to fine tune your network from there on with the help of regularization.
You determine the layers of your network depending upon the complexity of the problem and the resolution of the images as well. Bigger pictures need larger networks. The filters need to be increasing as the network increases and the size of the feature map should decrease in an ideal network. I don't know much about strides though. Max_Pooling is the most common solution instead of strides.

I could go on about Max_Pooling but that won't answer your question.

Sajid Ahmed · Answer 2 · 2019-04-04T07:03:53.417

I know it may be too late to post an answer, but still answering so that it might help someone stumbling upon this question.

Actually designing a particular architecture from scratch might be extremely difficult given the large number of hyperparameters that we might need to learn from a validation set, not to mention the computational resources required for training that many deep models(one for each hyperparameter combination in the worst case).

The general workaround for your situation would be to go for transfer learning which comes in the form of pre-trained models for computer vision tasks. Since the filters learned on images for one task can be usually applied quite successfully to images for another task for the purpose of feature extraction, these pre-trained models are the go-to solution. After that, you can perform some amount of fine-tuning for your task at hand, and also change the architecture a little bit from that point onward if necessary. These well-established architectures will at least serve as starting points for your design. You can also delve into the papers where these architectures were proposed for getting basic intuition on why different architectural choices were adopted which can be combined in rational ways for designing novel architectures if necessary.

You can take a look at this article for further reference.

Conceptual question on CNN and any multi layer neural network (Part 2)

2 Answers2