I am looking at the MegatronLM implementation, and the only thing that is cached are the results of xK and xV computation:
Which are then stacked with past values and still the full QK matrix is computed. The computation of keys and queries does not seem that expensive in comparison.