Decoder Transformer feedforward

Question

I have a question about the decoder transformer feed forward during training.

Let's pick an example: input data "i love the sun" traduction i want to predict (italian traduction) "io amo il sole".

Now i feed the encoder with the input "i love the sun" and i get the hidden states. Now i have to do multiple feed forwards on the decoder with the input "BOS io amo il" where BOS is a token that stands for beginning of sentence. So i have this feedforward i assume

[BOS, IO, AMO, IL] -> decoder -> IO
[BOS, IO, AMO, IL] -> decoder -> AMO
[BOS, IO, AMO, IL] -> decoder -> IL
[BOS, IO, AMO, IL] -> decoder -> SOLE

I think this is the correct way. And what should be applied to differentiate the training i think is the masked attention mechanism maybe(?) is it right to assume that the masking will be

[1 0 0 0,
 0 0 0 0 ,
 0 0 0 0,
 0 0 0 0]   for the first feed forward

[1 0 0 0,
 1 1 0 0 ,
 0 0 0 0,
 0 0 0 0]   for the second feed forward

[1 0 0 0,
 1 1 0 0 ,
 1 1 1 0,
 0 0 0 0]   for the third feed forward

[1 0 0 0,
 1 1 0 0 ,
 1 1 1 0,
 1 1 1 1]   for the fourth feed forward

is it the correct way? or what should be different? If you can provide me also a python implementation could be useful, thanks in advance.

score 1 · Accepted Answer · answered Mar 05 '21 at 10:54

1

There are some problems with your description:

During training, the decoder receives all the shifted target tokens, prepending the BOS token. You removed sole. The actual input would be: [<bos>, io, amo, il, sole]. Note that the output at the position of sole would be the end-of-sequence token <eos>.
During training, there is a single forward pass (not one per token), and all the output tokens are predicted at once. Therefore, only the last one of your attention masks is used.
During inference, we don't have the target tokens (because that is what we are trying to predict). In this case, we have one pass per generated token, starting with <bos>. This way, the decoder input in the first step would just be the sequence [<bos>], and we would predict the first token: io. Then, we would prepare the input for the next timestep as [<bos>, io], and then we would obtain the prediction for the second token. And so on. Note that, at each timestep, we are repeating the computations for the past positions; in real implementations, these states are cached instead of re-computed each timestep.

About some piece of Python code illustrating how the Transformer works, I suggest The annotated Transformer, which is a nice guide through a real implementation. You may be most interested in the function run_epoch for the training and in the function greedy_decode for the inference.

answered Mar 05 '21 at 10:54

noe

22,074
1
43
70

During inference the input of the tokens not still considered (in the future) should be all 0s? – erre4 Mar 05 '21 at 11:12
During inference, the future input tokens simply do not exist yet. We don't know how long the sentence will be until we reach the `` prediction. The sequence length at the first timestep is 1, at the second timestep is 2, and so on, until `` is predicted. – noe Mar 05 '21 at 11:38
Did something in the answer cause its unacceptance? – noe Mar 09 '21 at 16:53
sorry i think i misclicked on it. But i have also another question, if we have to predict all at once during training means that there is a afeedforward network for each ouput token at the end of the decoder? – erre4 Mar 10 '21 at 19:38
There is a single feedforward network for all of the positions, and it is applied for each of them (at the same time). – noe Mar 10 '21 at 19:45
ok, sorry if i'm repetitive, so if we have 5 tokens that must be predicted and the output of the decoder is of size (let assume) 5, and the vocabulary is made of a total of 5 words, we have a feed forward network with 5 different softmax and 5*5*5 total weights for this feedforward network? – erre4 Mar 10 '21 at 20:24
No, the feedforward networks in the Transformer are composed of two linear layers. The first one projects its input into a higher dimensionality space, then a ReLU is applied, and then the second one projects back into the original dimensionality. You can check the code [here](https://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks) – noe Mar 10 '21 at 20:39
i'm talking not about the feedforward network inside the decoder but the last final layer outside the transformer with the softmax function used for the prediction. – erre4 Mar 10 '21 at 20:43
oh i got it from here: https://stats.stackexchange.com/questions/392213/understand-the-output-layer-of-transformer, basically my misunderstaing was, since you said during training must be predicted all at once, how the linear + softmax outside the decoder handle all these outputs. it seems there is 1 linenar layer of size (d,N-words) that for each token output of the decoder applies the multiplication + softmax and it is shared across the tokens because it has seen as an embedding matrix – erre4 Mar 10 '21 at 21:14

Decoder Transformer feedforward

1 Answers1

Linked