2

I am looking at the MegatronLM implementation, and the only thing that is cached are the results of xK and xV computation:

https://github.com/NVIDIA/Megatron-LM/blob/b44dca25727c294a7f825e74a3c4a53744cc8404/megatron/model/transformer.py#L339

Which are then stacked with past values and still the full QK matrix is computed. The computation of keys and queries does not seem that expensive in comparison.

Franck Dernoncourt
  • 5,573
  • 9
  • 40
  • 75
LOST
  • 131
  • 1

0 Answers0