0

I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?

  • The canonical answer is to go with cosine similarity for (very) high dimensional word vectors. If you want to make sure, use cross-validation to test for your case: https://scikit-learn.org/stable/modules/cross_validation.html – Make42 Jul 05 '21 at 15:53

1 Answers1

2

Intuitively, if you normalized the vectors before using them, or if they all ended up having almost unit norm after training, then a small $l_1$ norm will imply that the angle between the vectors is small, hence the cosine similarity will be high. Conversely, almost colinear vectors will have almost equal coordinates because they all have unit length. So if one works well, the other will work well too.

To see this, remember the equivalence of $l_1$ and $l_2$ norms in $\mathbb{R}^n$, in particular that for any $x \in \mathbb{R}^n$ it holds that $||x||_2 \le ||x||_1$. We can use that to prove the first of the statements (the other is left as an exercise ;)

If $||u||_2 = ||v||_2 = 1$ and $||u-v||_1 \le \sqrt{2\epsilon}$, then $\langle u, v \rangle \ge 1 - \epsilon$.

To prove this just expand $||u-v||_2^2 = 2-2 \langle u, v \rangle$ to obtain:

$$\langle u, v \rangle = 1 - \frac{1}{2} ||u-v||_2^2 \ge 1- \frac{1}{2} ||u-v||_1^2 \ge 1 - \epsilon.$$

So in the end which one you choose is up to you. One reason to prefer the cosine is differentiability of the scalar product, which if you assume normed vectors is all you need.

Miguel
  • 355
  • 1
  • 8
  • Actually i don't edit the vectors coming from embedding and they don't add to 1 so i don't know if they are normalized or not ! – Mohy Mohamed Jun 27 '21 at 09:45
  • Note that the normalization above is in $l_2$ norm, so it's the sum of the squared components that would add to 1. Also they don't need to be perfectly normalized for the above argument to work. As long as the $l_2$ norms are uniformly lower bounded by some number $1-\delta$ for small $\delta$, you can subsume the $\delta$ in the upper bound for $||u-v||_1$. – Miguel Jun 27 '21 at 12:35
  • For many kinds of models one must normalize inputs. Since BERT features are used in many downstream tasks, it is possible that the model you used is already regularized in a way that provides some "soft guarantees" of this, but I'm just speculating. – Miguel Jun 27 '21 at 12:41
  • so you mean i need to make sure that the vectors coming from the embedding are normalized and if not i should normalize them ? and anyways i didn't get yet should i used cosine similarity or Manhattan between the query vector and documents vectors – Mohy Mohamed Jun 28 '21 at 09:01
  • Sorry, I guess I wasn't clear enough. The point I was trying to make is that, *if* your vectors are normalized, *then* it is to be expected that both cosine and Manhattan / $l_1$ will provide almost equal results and differentiability might be one reason to prefer one over the other, but not performance. In the comments I mentioned that it is standard practice to normalize inputs to models. And because this is so, it could be that whatever you used to train your model already (almost) normalized its outputs for you. But there I'm just guessing. – Miguel Jun 28 '21 at 10:20
  • If you want some rule of thumb, then prefer cosine over Manhattan because it is not sensitive to normalization. Angles between vectors do not depend on their magnitudes. And language embeddings don't (typically?) rely on the lengths of vectors. – Miguel Jun 28 '21 at 10:22