Preparing ground truth labels for YOLO3

Question

I want to train YOLO3 for a custom dataset that has raw labels in JSON format. Each bounding box in JSON is specified as [x1, y1, x2, y2].

So far, I have converted [x1, y1, x2, y2] to [cx, cy, pw, ph], that is, center x, center y of the bounding box, scaled to image width and height; pw, ph are the ratios of bounding box's width and height relative to the image's width and height. But I don't think that's complete (or even right).

As far as I understood, YOLO3 assigns N anchor boxes to each grid cell (image is divided into SxS grid cells) and thus, the prediction of a bounding box is relative to a given anchor box from a grid cell (that one anchor box that has the highest IOU with the ground truth). The formulas are below:

Therefore, how should I prepare ground truths so that YOLO3 can understand them? Do I have to, somehow, reverse those formulas? Also how to account for different number of scales and different number of anchor boxes?

For a concrete example: Suppose I have a 416 x 416 image and a configuration of 13 x 13 grid cells. The ground truth bounding box (from the dataset) is [x1=100, y1=100, x2=200, y2=200], class = c. What will be the converted values for YOLO3?

L.E.: Say we have 2 classes [car, person] and 2 anchors (1 wide, 1 tall).

Would the output be a tensor of shape 13 x 13 x (2*(5+2)) where most of the values (that vector of shape 2*(5+2)) for the grid cells are 0 except for one particular cell (the one in which the center of the ground truth bounding box falls)?

In this case, for that cell (say c[i,j]), suppose the largest IOU is given for anchor 2 and that the ground truth class is person. This means that c[i,j,:7] (anchor 1 prediction) will be ignored and c[i,j,7:] (anchor 2 prediction) will be [bx, by, bw, bh, conf, 0, 1].

Therefore, how should the ground truth for the person's bounding box should be encoded? Should it be an offset from a particular anchor of a grid cell? This is what it's still unclear for me.

Thank you!

score 1 · Answer 1 · answered Jun 03 '21 at 08:33

Therefore, how should the ground truth for the person's bounding box should be encoded?

The bounding boxes are defined by the normalized coordinate of the bounding box center. It should be represented as a single line in a text file as:

<object-class> <x> <y> <bb_width> <bb_height>

Given 3 classes:

0 full-image
1 top-left-quater
2 left-half

Three bounding boxes that fill, respectively, the full image, the top left quater and the left half would the represented like this:

0 0.500000 0.500000 1.000000 1.000000
1 0.250000 0.250000 0.500000 0.500000
2 0.250000 0.500000 0.500000 1.000000

For a bit more detail:

https://pjreddie.com/darknet/yolo/ under "Generate Labels for VOC"
https://github.com/AlexeyAB/Yolo_mark/issues/60

Preparing ground truth labels for YOLO3

1 Answers1