In R-CNN paper, they give the definition of the target values for bounding box regression
Given that $(P, G)$ is a (prediction box, ground-truth box) pair of the form $(x, y, w, h)$ where $x, y$ is the center coordinate of the box, $w, h$ are width and height respectively.
$t_x = (G_x - P_x) / P_w \hspace{2.0cm} t_y = (G_y - P_y) / P_h$
$t_w = \log(G_w / P_w) \hspace{2.0cm} t_h = \log(G_h / P_h)$
And the goal is to find $\textbf{w}_*$, where $*$ can be $x, y, w$ or $h$, so that
$\textbf{w}_* = \arg \min_{\hat{\textbf{w}}_*} \sum_i (t^i_* - \hat{\textbf{w}}_*^T \phi(P^i))^2 + \lambda \|\hat{\textbf{w}}_*\|^2$ where $\phi(P^i)$ is the feature map given by the last pooling layer of the feature extractor after passing predicted bounding box $P^i$
I don't understand why they come up with this approach of bounding box regression ? Can anyone tell me about this ?
P.S: since this regression approach is used not only in R-CNN but also in later models, I really want to get a clear understanding of this