1

Hi, in the original paper the following scheme of the self-attention appears: https://arxiv.org/pdf/1805.08318.pdf enter image description here

In a later overview: https://arxiv.org/pdf/1906.01529.pdf

this scheme appears: enter image description here referring the original paper.

My understanding more correlates with the second paper scheme, as: enter image description here Where there is two dot-product operations and three hidden parametric matrices: $$W_k, W_v, W_q$$ which corresponds to $W_f, W_g, W_h$ without $W_v$ as it in the original paper explanation, which is as following:

enter image description here

Is this a mistake in the original paper ?

Ilya.K.
  • 157
  • 6

0 Answers0