I am curious to know what the state of the art is in using transformers for regression. Ultimately what I am interested in is how researchers in this field aggregate the outputs of the final attention layer. Has there been anything implemented that is more novel than simply pooling?
The best I have found is called attentive pooling used in a pointcloud neural network https://arxiv.org/abs/1911.11236. here they aggregate latent vectors using weighted pooling, where a network outputs elementwise weights for each latent vector.
I am hoping you can share with me some research papers.