Why do GPT models use a transpose of the embedding matrix to convert outputs to logits?

Question

According to section 3.1 of the original GPT paper, GPT right-multiplies the final output vectors (after applying a Transformer decoder model) by the transpose of the embedding matrix, before applying a softmax. See this comment for further verification.

Why? I feel like it would be simpler to learn a separate matrix for this task. Doing so wouldn't be particularly computationally expensive, and it wouldn't significantly increase the parameter count. Additionally, I don't see any mathematical link between the matrix and its transpose in this context; I don't think the dual space is going to matter here... Overall, I see very little connection between embedding tokens and obtaining logits. So what's the point of using the transpose? Does it just tend to perform well empirically / in practice?

I just found a [similar question on Cross Validated](https://stats.stackexchange.com/questions/584685/why-do-large-lms-use-the-transpose-of-the-word-embeddings-matrix-in-the-classifi), which says researchers do this "parameter tying" to regularize the model. I still find it weird. Why are we applying parameter tying to this particular part of the model? Why not use it elsewhere? E.g., why not have two layers of the transformer use the same parameters? — jskattt797, Aug 09 '23 at 06:53

score 1 · Answer 1 · answered Aug 09 '23 at 10:46

As you described in your comment, this technique regularizes the model by reusing the same weights for the logit computation. I think, however, that there is previous reference to this technique prior to article provided in the linked answer, specifically in the article Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling (published at ICLR'16).

This embedding matrix, both at the input and at the output, is just a table of token embeddings:

At the input, we use it through an embedding layer, which uses the matrix as a lookup table to translate token indices to token vectors.
At the output, we use the matrix to multiply the model output vector; this is equivalent to computing the dot product of the model output with each of the vectors of the embedding matrix. The result of this multiplication is a vector with the "similarity" between the model output and each of the token vectors. These similarities can then be normalized into probabilities by means of the softmax function.

The point of transposing the matrix is to enable the matrix multiplication with the model output. Without transposing, we would not be able to compute the multiplication of the output of the model with each of the vectors in the matrix.

By sharing the embedding matrix, we are reducing the total number of parameters of the model. Given that the vocabulary size of modern LMs or encoder/decoder architectures is in the order of 30k-50k and the embedding dimensionality can be in the range of 512-2024, the amount of parameters we are saving is not negligible. By reducing the number of parameters, we also reduce the possibility of overfitting, therefore regularizing the model.

This technique is feasible because the set of input tokens is the same as the set of output tokens, so it makes sense to use the same vectors for the output token space. There have been other attempts at parameter sharing between different parts of the transformer model (e.g. this and this), but with less degree of success, so they have not been adopted.

Why do GPT models use a transpose of the embedding matrix to convert outputs to logits?

1 Answers1