4

I'm working through Attention is All you Need, and I have a question about masking in the decoder. It's stated that masking is used to ensure the model doesn't attend to any tokens in the future (not yet predicted), so it can be used autoregressively during inference.

I don't understand how masking is used during inference. When the encoder is given an unseen sample with no ground truth output or prediction, it seems to me that there is nothing to mask, since there aren't any output tokens beyond what the decoder has already produced. Is my understanding of masking correct?

Thanks!

David Rein
  • 53
  • 6

1 Answers1

2

The trick is that you do not need masking at inference time. The purpose of masking is that you prevent the decoder state from attending to positions that correspond to tokens "in the future", i.e., those that will not be known at the inference time, because they will not have been generated yet.

At inference time, it is no longer a problem because there are no tokens from the future, there have not been generated yet.

Jindřich
  • 1,661
  • 5
  • 8
  • This is wrong, see: https://ai.stackexchange.com/questions/16516/is-the-mask-needed-for-masked-self-attention-during-inference-with-gpt-2 You still need to apply masks in the decoder for all tokens except the final one, just like during training. – Simon Boehm Jul 17 '22 at 19:59
  • It depends on how you use that decoder. If you start generating an empty string (like in MT), you only need to remember the past states and always when you do the self-attention, you always have the previous states and nothing to mask out. When you generate using a GPT-like model starting from a prompt, of course, you need to the masking for the prompt part. Also if you do not cache the previous hidden states and you want to recompute them in every generation step, you need to do the masking (which is what the answer you refer to assumes). – Jindřich Jul 18 '22 at 09:47
  • @SimonBoehm Are you sure? :) Why do you need to "mask" even if it's your own prediction, not a given answer(ground truth)? And your link is not about the original transformer, which the OP is talking about. – starriet Apr 27 '23 at 07:03