Is attention cache useful during transformer pretraining?

Asked Sep 07 '22 at 17:06

Active Jan 05 '23 at 21:13

Viewed 187 times

I am looking at the MegatronLM implementation, and the only thing that is cached are the results of xK and xV computation:

https://github.com/NVIDIA/Megatron-LM/blob/b44dca25727c294a7f825e74a3c4a53744cc8404/megatron/model/transformer.py#L339

Which are then stacked with past values and still the full QK matrix is computed. The computation of keys and queries does not seem that expensive in comparison.

edited Jan 05 '23 at 21:13

Franck Dernoncourt

5,573
9
40
75

asked Sep 07 '22 at 17:06

LOST

Is attention cache useful during transformer pretraining?

0 Answers0