I'm looking for near-duplicate detection algorithms or techniques for a near-real-time system with large document volumes. I know LSH is the most popular industry-standard algorithm for syntactical use cases but I'm trying to find better alternatives to the LSH. I've shortlisted the following algorithms or techniques to give a try but I would like to gain insights from experienced data scientists who tried the following or other techniques in production that gave high precision and recall values. It would be great if someone has any comparison or benchmark reports etc.
- Multi Probe LSH
- Spherical LSH
- Earth Mover's Distance
PS:: I know there are KNN/ANN solutions that offer semantical near-dup detection techniques in machine learning. I'm specifically exploring near-dup detection techniques which do not require any training or building a feature vector to train. But open if some technique needs less effort on training or building vector too.