1

I'm looking at the architecture proposed in the following paper: Baoguang Shi et al, An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.

In the proposed architecture of the model, a MaxPooling Window:1 × 2, s:2 layer is mentioned. I'm not sure what the size of the output of this layer would be. Architecture of model

If i have an input of size (32 x 8), then the output would be:

(32-1)/2 + 1 = 16.5, <- this part doesn't make sense to me

(8-2)/2 + 1 = 4

*ignoring depth and batch size here

1 Answers1

0

According to the paper, maybe "s" represents stride in row, while the stride in column equals 1.