2

I have a question about the decoder transformer feed forward during training.

Let's pick an example: input data "i love the sun" traduction i want to predict (italian traduction) "io amo il sole".

Now i feed the encoder with the input "i love the sun" and i get the hidden states. Now i have to do multiple feed forwards on the decoder with the input "BOS io amo il" where BOS is a token that stands for beginning of sentence. So i have this feedforward i assume

  • [BOS, IO, AMO, IL] -> decoder -> IO
  • [BOS, IO, AMO, IL] -> decoder -> AMO
  • [BOS, IO, AMO, IL] -> decoder -> IL
  • [BOS, IO, AMO, IL] -> decoder -> SOLE

I think this is the correct way. And what should be applied to differentiate the training i think is the masked attention mechanism maybe(?) is it right to assume that the masking will be

[1 0 0 0,
 0 0 0 0 ,
 0 0 0 0,
 0 0 0 0]   for the first feed forward

[1 0 0 0,
 1 1 0 0 ,
 0 0 0 0,
 0 0 0 0]   for the second feed forward

[1 0 0 0,
 1 1 0 0 ,
 1 1 1 0,
 0 0 0 0]   for the third feed forward

[1 0 0 0,
 1 1 0 0 ,
 1 1 1 0,
 1 1 1 1]   for the fourth feed forward

is it the correct way? or what should be different? If you can provide me also a python implementation could be useful, thanks in advance.

erre4
  • 95
  • 7

1 Answers1

1

There are some problems with your description:

  • During training, the decoder receives all the shifted target tokens, prepending the BOS token. You removed sole. The actual input would be: [<bos>, io, amo, il, sole]. Note that the output at the position of sole would be the end-of-sequence token <eos>.

  • During training, there is a single forward pass (not one per token), and all the output tokens are predicted at once. Therefore, only the last one of your attention masks is used.

  • During inference, we don't have the target tokens (because that is what we are trying to predict). In this case, we have one pass per generated token, starting with <bos>. This way, the decoder input in the first step would just be the sequence [<bos>], and we would predict the first token: io. Then, we would prepare the input for the next timestep as [<bos>, io], and then we would obtain the prediction for the second token. And so on. Note that, at each timestep, we are repeating the computations for the past positions; in real implementations, these states are cached instead of re-computed each timestep.

About some piece of Python code illustrating how the Transformer works, I suggest The annotated Transformer, which is a nice guide through a real implementation. You may be most interested in the function run_epoch for the training and in the function greedy_decode for the inference.

noe
  • 22,074
  • 1
  • 43
  • 70
  • During inference the input of the tokens not still considered (in the future) should be all 0s? – erre4 Mar 05 '21 at 11:12
  • During inference, the future input tokens simply do not exist yet. We don't know how long the sentence will be until we reach the `` prediction. The sequence length at the first timestep is 1, at the second timestep is 2, and so on, until `` is predicted. – noe Mar 05 '21 at 11:38
  • Did something in the answer cause its unacceptance? – noe Mar 09 '21 at 16:53
  • sorry i think i misclicked on it. But i have also another question, if we have to predict all at once during training means that there is a afeedforward network for each ouput token at the end of the decoder? – erre4 Mar 10 '21 at 19:38
  • There is a single feedforward network for all of the positions, and it is applied for each of them (at the same time). – noe Mar 10 '21 at 19:45
  • ok, sorry if i'm repetitive, so if we have 5 tokens that must be predicted and the output of the decoder is of size (let assume) 5, and the vocabulary is made of a total of 5 words, we have a feed forward network with 5 different softmax and 5*5*5 total weights for this feedforward network? – erre4 Mar 10 '21 at 20:24
  • No, the feedforward networks in the Transformer are composed of two linear layers. The first one projects its input into a higher dimensionality space, then a ReLU is applied, and then the second one projects back into the original dimensionality. You can check the code [here](https://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks) – noe Mar 10 '21 at 20:39
  • i'm talking not about the feedforward network inside the decoder but the last final layer outside the transformer with the softmax function used for the prediction. – erre4 Mar 10 '21 at 20:43
  • oh i got it from here: https://stats.stackexchange.com/questions/392213/understand-the-output-layer-of-transformer, basically my misunderstaing was, since you said during training must be predicted all at once, how the linear + softmax outside the decoder handle all these outputs. it seems there is 1 linenar layer of size (d,N-words) that for each token output of the decoder applies the multiplication + softmax and it is shared across the tokens because it has seen as an embedding matrix – erre4 Mar 10 '21 at 21:14