12

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here:

https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/

could yield three clusters ({A}, {B,C,D} and {E}), for instance, if its results were:

A -> h01
B -> h02
C -> h02
D -> h02
E -> h03

Similarly, the MinHashing technique described in the Chapter 3 of the MMDS book:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

could also yield the same three clusters if its results were:

A -> h01 - h02 - h03

B -> h04 - h05 - h06
      |
C -> h04 - h07 - h08
                  |
D -> h09 - h10 - h08

E -> h11 - h12 - h13

(Each set corresponds to a MH signature composed of three "bands", and two sets are grouped if at least one of their signature bands is matching. More bands would mean more matching chances.)

However I have several questions related to these:

(1) Can SH be understood as a single band version of MH?

(2) Does MH necessarily imply the use of a data structure like Union-Find to build the clusters?

(3) Am I right in thinking that the clusters, in both techniques, are actually "pre-clusters", in the sense that they are just sets of "candidate pairs"?

(4) If (3) is true, does it imply that I still have to do an $O(n^2)$ search inside each "pre-cluster", to partition them further into "real" clusters? (which might be reasonable if I have a lot of small and fairly balanced pre-clusters, not so much otherwise)

cjauvin
  • 451
  • 3
  • 7

1 Answers1

3

As correctly pointed out above MinHash and SimHash both belong to Locality Sensitive Hashing. Reference: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

The main difference between the two is the way collision is handled,

  1. SimHash, uses cosine similarity
  2. MinHash, uses Jaccard Index.

Answers to your questions:

  1. No. They uses different collision handling techniques to validate similarity. Also there is a variant on single Hash Function for Min Hash but it works differently. For more details, check the following reference out, https://en.wikipedia.org/wiki/MinHash (Variant with a single hash function)
  2. Yes, https://github.com/chrisjmccormick/MinHash/blob/master/runMinHashExample.py
  3. I think the complexity can be reduced to $O(n \log n)$ with modified form of binary search while clustering.
Alexey Grigorev
  • 2,850
  • 1
  • 12
  • 18
Pramit
  • 298
  • 1
  • 9
  • SimHash and MinHash do not use these similarity functions. I think a better way to say it would be that they create digests which approximate these functions. – Alexey Grigorev Aug 05 '15 at 09:15
  • @AlexeyGrigorev I am a little confused. I looked into the following implementation for minHash 'computeSimilarityFromSignatures' @ [link](https://mymagnadata.wordpress.com/2011/01/04/minhash-java-implementation/). It uses a |HashedArray(A) & HashedArray(B)|/ (total number of entries) – Pramit Aug 05 '15 at 19:15