1

Via this MIT document i studied the single layer Perceptron convergence proof (= maximum number of steps).

In the convergence proof inside this document , the learning rate is implicitly defined as 1 . After studying it, by myself i tried to re-do the proof inserting this time a generic learning rate $\eta $.

The result is that the result remains the same:

$k\leq \frac{R^{2}\left \| \theta ^{*} \right \|^{2}}{\gamma ^{2}}$

that is the learning rate $\eta$ cancelled out in the proof.

Is it possible , or i make mistakes in my proof ?

Qwerto
  • 675
  • 1
  • 8
  • 15

1 Answers1

2

Your conclusion is correct.

Note that the classifier is of the form of

$$f(x; \theta) = \operatorname{sign}(\theta^Tx)$$

while the update rule is

$$\theta^{(k+1)}=\theta^{(k)}+\eta y_tx_t$$ which only occurs when there is a misclassification and we only care about the in the classification but

$$\operatorname{sign}(\theta^Tx)=\operatorname{sign}(\eta \theta^Tx)$$

as long as $\eta$ is positive.

It only scales $\theta^Tx$ but our concern is just the sign, hence the learning rate is irrelevant.

Remark: You might like to note the assumptions being used in the proof though. For instance, there is this assumption that $\theta^*$ exists. It is assumed that the data can be perfectly separated by a hyperplane.

Siong Thye Goh
  • 3,003
  • 2
  • 15
  • 23
  • Great help !!! :) ps: also the convergence speed remains the same ? That is , the number of k with learning rate = 1 vs learning rate = 0.1 (for example) – Qwerto Feb 19 '18 at 17:07
  • (In the proof we have defined the upper limit of k) – Qwerto Feb 19 '18 at 17:20
  • in fact, under assumption that $\theta^*$ exists and also the implementation of setting the initial value to $0$ vector, the two sequence differs only by a scaling of the learning rate. – Siong Thye Goh Feb 19 '18 at 19:30