The MinHash algorithm is used to compute the similarity of two sets. The value calculated by MinHash is near to the Jaccard similarity coefficient. The Minhash steps are:
- Let f map all shingles(k-grams) of the universe to 1...2^m
- apply a random permutation on 1...2^m
- pick the min index of permuted items of each set
- repeat steps 2 and 3, n times.
- compute number of times the min value of min index of sets are equal / n (tihs is near to the Jaccard similarity coefficient).
Why the algorithm does a random permutation on hash values and the select the minHash? why does not directly select a random hash from each set and compare them, n times?