4

In the papers "Convolutional Sequence to Sequence Learning" and "Attention Is All You Need", positions embeddings are simply added to the input words embeddings to give the model a sense of the order of the input sequence. These position embeddings are generated from a sinusoidal signal depending on the absolute position of the word in the sequence and the dimension. We obtain position embeddings of the same dimension as the word embeddings and we simply sum these two.

I can understand that this helps the model to get a sens of the ordering of the input, but I'm quite disturbed by the fact that adding these two might also erase some of the information contained in the word embeddings. Do you have an explanation on why this might work (or not) ? Is there some literature about it ?

Robin
  • 1,307
  • 9
  • 19
  • 2
    Same question. Why the random matrix can be trained to contains the position info. https://github.com/google-research/bert/blob/master/modeling.py#L491-L520 – DunkOnly Jan 03 '19 at 06:12

1 Answers1

1

The token embeddings are not fixed, they are learned. Therefore, during training, the value learned for the token embeddings is intrinsically one that is useful after adding it up with the positional embeddings. Token embeddings are trained precisely based on such a situation and therefore nothing gets "erased", just that the learned vectors are appropriate to be combined with positional embeddings.

noe
  • 22,074
  • 1
  • 43
  • 70