1

I am studying the math behind SVM. The following question is about a small but important detail during the SVM derivation.

The question

Why the distance between the hyperplane $w*x+b=0$ and data point (in vector form) $p$, $d = \frac{w * p + b}{||w||}$ can be simplified to $d = \frac{1}{||w||}$?

My argument

Since data point $p$ is not on the hyperplane, then we have $w*p+b=k, k \ne 0$. Then $d=\frac{k}{||w||}$ but $k$ is not a constant as it depends on $w$. So optimizing $d=\frac{k}{||w||}$ is not the same as optimizing $d = \frac{1}{||w||}$.

Mentions of the distance calculation step in materials:

  1. page 4 of https://www.ugpti.org/smartse/resources/downloads/support-vector-machines.pdf
  2. slide 10 of http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
  3. slide 11 of https://www.cs.upc.edu/~mmartin/SVMs.pdf
desertnaut
  • 1,908
  • 2
  • 13
  • 23
Alan Yue
  • 21
  • 2
  • Are you talking about page 5 of the first resource? – kate-melnykova Nov 01 '20 at 07:21
  • @kate-melnykova thanks for your comment. page 5 shows the steps to calculate the distance, but the assumption of $w*x+b=1$ started from page 4 (search for "For mathematical convenience"). I don't understand how $\delta = 1$ works for all possible scenarios. – Alan Yue Nov 01 '20 at 07:32
  • I recommend you to have a look at pages 3-7 of: [CS229 Suppor Vector Machines](https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf), (it's explained at the end of page 6) – Javier TG Nov 01 '20 at 08:23
  • 1
    @JavierTG thanks for the lead. However, I followed all the way until near the end of page 6, but got a question when "getting rid of $||w||=1$". How come $||w||$ is still in the objective function $max[w, b] \frac{r\hat}{||w||}$? Since we assumed $||w||=1$. – Alan Yue Nov 01 '20 at 13:11
  • That's because the assumption of $\Vert w\Vert=1$ was made in order to have that $\gamma = \hat{\gamma}$. This way by maximizing the functional margin, $\hat{\gamma}$, we would also maximize the geometric margin, $\gamma$ (this is what we want). So if we instead maximize $\hat{\gamma}/\Vert w\Vert = \gamma$ we don't have to worry about $\Vert w\Vert$ being $=1$ anymore. Hence $\Vert w\Vert=1$ is no longer a constraint. – Javier TG Nov 01 '20 at 13:38
  • Please check here: https://datascience.stackexchange.com/questions/55592/mathematical-formulation-of-support-vector-machines/55595#55595 – Fatemeh Asgarinejad Nov 02 '20 at 19:24

1 Answers1

1

I believe I have found the answer. A great thank-you to @Javier TG for pointing me to Andrew Ng's notes CS229 Suppor Vector Machines. I would not write down the full context as materials in the lecture notes and questions references do a much better job than I ever will. I will, however, try to fill in the tiny logic gap in many notes: why $w*x+b=1$ can represent any $w*x+b=\lambda, \lambda \ne 0$.

The key understanding is that the distance $d_i$from a point $x_i$ to the plane $w*x+b=0$ is invariant to scaling of $w$ and $b$. Below is a proof.

From the notes, we derived the distance as $$d_i=\frac{y_i(w*x_i+b)}{\|w\|}$$ This shows that $d_i$ is invariant to scaling of $w, b$. Meaning replacing $w, b$ with $2w, 2b$ gives the same $d_i$. You should see it for yourself. Then it is straightforward that the maximum of $d_i$, $$ d=\max_{w} d_i, i=1,...,n $$ Is also invariant to the scaling.

However, let $l_i$ represent the nominator in the distance equation, $l_i = y_i(w*x_i+b)$. You will see $l_i$ scales at the same magnitude as that of $w, b$. Meaning replacing $w, b$ with $2w, 2b$ gives $2l_i$. You should also see it for yourself. Then the maximum of $l_i$, $l$ as $$ l=\max_{w}l_i, i=1,...,n $$ scales with $w, b$.

Now revisit the optimization problem derived in the notes, $$ \begin{aligned} \max_{w,b} \quad & d\\ \textrm{s.t.} \quad & y_i(w*x_i+b)\geq l, i=1,...,n \end{aligned} $$

It is okay that we tweak $w, b$ to find the optimum solution, but we also got $d, l$ in the problem statement which makes the problem more complex. It will be great if we can get rid of $d, l$. We do this by leveraging the aforementioned findings. With $l$ scales with $w,b$, we can require that $l'=l*k=1$. To maintain the inequality constraint, we do the same scaling on the left hand side. $$y_i(kw*x_i+kb)\geq kl=1$$. Since $d$ is invariant to scaling, the objective function is not changed. We are still solving the same optimization problem.

Finally, we replace $w=kw, b=kb$, and $d=\frac{l}{\|w\|}=\frac{1}{\|w\|}$, and get the form of the same optimization problem stated in the notes: $$ \begin{aligned} \max_{w,b} \quad & \frac{1}{\|w\|}\\ \textrm{s.t.} \quad & y_i(w*x_i+b)\geq 1, i=1,...,n \end{aligned} $$

Now we know, in our optimization problem, any constraint $y_i(w*x_i+b)\geq l$ is equivalent to $y_i(w*x_i+b)\geq 1$.

Alan Yue
  • 21
  • 2