As you described in your comment, this technique regularizes the model by reusing the same weights for the logit computation. I think, however, that there is previous reference to this technique prior to article provided in the linked answer, specifically in the article Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling (published at ICLR'16).
This embedding matrix, both at the input and at the output, is just a table of token embeddings:
- At the input, we use it through an embedding layer, which uses the matrix as a lookup table to translate token indices to token vectors.
- At the output, we use the matrix to multiply the model output vector; this is equivalent to
computing the dot product of the model output with each of the vectors of the embedding matrix. The result of this multiplication is a vector with the "similarity" between the model output and each of the token vectors. These similarities can then be normalized into probabilities by means of the softmax function.
The point of transposing the matrix is to enable the matrix multiplication with the model output. Without transposing, we would not be able to compute the multiplication of the output of the model with each of the vectors in the matrix.
By sharing the embedding matrix, we are reducing the total number of parameters of the model. Given that the vocabulary size of modern LMs or encoder/decoder architectures is in the order of 30k-50k and the embedding dimensionality can be in the range of 512-2024, the amount of parameters we are saving is not negligible. By reducing the number of parameters, we also reduce the possibility of overfitting, therefore regularizing the model.
This technique is feasible because the set of input tokens is the same as the set of output tokens, so it makes sense to use the same vectors for the output token space. There have been other attempts at parameter sharing between different parts of the transformer model (e.g. this and this), but with less degree of success, so they have not been adopted.