I have a dataset with thousands of music score pages and manually annotated bounding boxes for the individual bars:
My objective is now to train a DNN that should ultimately be able to get these bounding boxes on its own. First idea was to use something like the Region Proposal Network (RPN) from Faster R-CNN on top of ResNet or VGG, but I am unsure if this still works because the "objectness" is rather high for almost each section of the page. Plus the regions are mostly touching each other but rarely overlap. Number of bars is roughly somewhere between 1 and 250 per page.
Additionally, the number of systems (=rows of bars) per page is oftentimes not changing between subsequent pages. This might be a very helpful info that RPN would miss. Maybe introduce some sort of recurrency?
Is there anything out there that would be more tailored to my specific problem? Any advise on a better fitting architecture or further tweaks would be highly appreciated.


