Why minhash algorithm use random permutation not random selection?

Question

The MinHash algorithm is used to compute the similarity of two sets. The value calculated by MinHash is near to the Jaccard similarity coefficient. The Minhash steps are:

Let f map all shingles(k-grams) of the universe to 1...2^m
apply a random permutation on 1...2^m
pick the min index of permuted items of each set
repeat steps 2 and 3, n times.
compute number of times the min value of min index of sets are equal / n (tihs is near to the Jaccard similarity coefficient).

Why the algorithm does a random permutation on hash values and the select the minHash? why does not directly select a random hash from each set and compare them, n times?

score 0 · Answer 1 · answered May 27 '23 at 03:27

If I understand your question correctly, then it is because of the order-preserving property of the function. Permutation is for measuring the importance of each element in the set.

In overly simplest terms, Imagine a sentence of words. Rather than directly tokenizing it, you create permutations of words so that you can measure how each word contributes to the uniqueness of the hash value generated. In other words, how informative and representative each word is of the whole set/sentence.

Why minhash algorithm use random permutation not random selection?

1 Answers1