Normalizing the final weights vector in the upper bound on the Perceptron's convergence

Question

The convergence of the "simple" perceptron says that:

$$k\leqslant \left ( \frac{R\left \| \bar{\theta} \right \|}{\gamma } \right )^{2}$$

where $k$ is the number of iterations (in which the weights are updated), $R$ is the maximum distance of a sample from the origin, $\bar{\theta}$ is the final weights vector, and $\gamma$ is the smallest distance from $\bar{\theta}$ to a sample (= the margin of hyperplane).

Many books implicitly say that $\left \| \bar{\theta} \right \|$ is equal to 1. But why do they normalize it ?

Elias Strehle · Accepted Answer · 2018-02-18T07:29:54.717

2

The Perceptron's output $f$ is $$ f(\overline\theta \cdot \overline{x}) = \begin{cases} 1 &\text{ if } \overline{\theta }\cdot \overline{x} > 0 \\ 0 &\text{ if } \overline{\theta }\cdot \overline{x} \le 0\end{cases} $$ Here, $\overline{x} = (1, x_1, \dots, x_n)$ where $(x_1, \dots, x_n)$ is the input vector.

You can see that the output only depends on the sign of the product $\overline{\theta }\cdot \overline{x}.$ Therefore the output will not change if we multiply $\overline\theta \cdot \overline{x}$ with a positive constant: $$ f(\overline\theta \cdot \overline{x}) = f(c\times \overline\theta \cdot \overline{x}) \qquad \text{ for all } c > 0. $$ In particular, we may choose $c = 1/\lVert \overline\theta \rVert.$ So we get $ f(\overline{\theta }\cdot \overline{x}) = f(\tilde\theta\cdot \overline{x}), $ where $\tilde\theta = \overline\theta / \lVert \overline\theta \rVert$ is the normalized vector. Conclude that there is no loss of generality in assuming $\lVert \overline\theta \rVert = 1.$

edited Feb 18 '18 at 07:29

answered Feb 16 '18 at 15:56

Elias Strehle

1,636
9
25

Thanks so much. you also know why the "learning rate" doens't influence convergence ? – Qwerto Feb 16 '18 at 15:59
1

For the same reason. A learning rate would enter the model in the same way as our constant $c > 0.$ – Elias Strehle Feb 16 '18 at 16:17
The fact is that, considering the initial inequality of my post , different multiples of $\bar{\theta}$ ,despite giving the same hyperplane, makes the number of iterations k smaller, as the absolute value of $\bar{\theta}$ becomes different – Qwerto Feb 16 '18 at 16:52

Oren Milman · Answer 2 · 2018-08-12T04:51:17.540

Note that the condition $k\leqslant \left ( \frac{R\left \| \bar{\theta} \right \|}{\gamma } \right )^{2}$ makes sense only if $\gamma$ is dependent on $\bar{\theta}$.
Otherwise (i.e. if $\gamma$ and $\bar{\theta}$ are independent), we could choose $\tilde\theta = \frac{\gamma}{R\lVert \bar\theta \rVert}\cdot \bar\theta$, and as Elias demonstrated, $\tilde\theta$ would remain a valid final weights vector, and we would get $$k \leqslant \left ( \frac{R\left \| \tilde{\theta} \right \|}{\gamma } \right )^{2}=\left ( \frac{R\left \| \frac{\gamma}{R\lVert \bar\theta \rVert}\cdot \bar\theta \right \|}{\gamma } \right )^{2}=1$$ which is obviously not always true.

So I guess that wherever you saw the condition $k\leqslant \left ( \frac{R\left \| \bar{\theta} \right \|}{\gamma } \right )^{2}$, the definition of $\gamma$ was something along these lines:
The margin of the hyperplane defined by $\bar \theta$ (a weights vector that also includes the bias) is the maximum $\gamma>0$ such that:

for every positive sample $x$, it holds that $\bar \theta ^ T x \geqslant \gamma$.
for every negative sample $x$, it holds that $\bar \theta ^ T x \leqslant -\gamma$.

Now, for any $c>0$, if we choose $\tilde\theta = c\cdot \bar\theta$, then the margin of the hyperplane defined by $\tilde\theta$ is $\tilde \gamma=c \cdot \gamma$, and thus:$$\frac{R\left \| \bar{\theta} \right \|}{\gamma } = \frac{R\left \| \tilde{\theta} \right \|}{\tilde\gamma } $$
Finally, I guess that people (e.g. wikipedia) usually choose $\bar{\theta}$ such that $\left \| \bar{\theta} \right \|=1$ because then we get the condition $k\leqslant \left ( \frac{R}{\gamma } \right )^{2}$, which looks nicer.

By the way, I was disappointed to find out that choosing $\bar{\theta}$ such that $\left \| \bar{\theta} \right \|=1$ doesn't give the $\gamma$ you would have expected from looking at the examples and the hyperplane in a graph. e.g.:

The hyperplane in the graph is the line $x=6$, and for $\bar{\theta}= \left(\begin{gathered}1\\ 0\\ -6 \end{gathered} \right)$ we would get $\gamma=1$, which fits the geometric intuition. However, $\left \| \bar{\theta} \right \|\not=1$ in this case (because of the trick of putting the bias inside $\bar{\theta}$).

For $\bar{\theta}= \left(\begin{gathered}\frac{1}{\sqrt{37}}\\ 0\\ -\frac{6}{\sqrt{37}} \end{gathered} \right)$ it holds that $\left \| \bar{\theta} \right \|=1$, but $\gamma=\frac{1}{\sqrt{37}}$.

Normalizing the final weights vector in the upper bound on the Perceptron's convergence

2 Answers2

Linked

Related