Near duplicate detection algorithms for a near real time system

Asked Jun 19 '22 at 16:23

Active Jul 18 '22 at 11:20

Viewed 42 times

I'm looking for near-duplicate detection algorithms or techniques for a near-real-time system with large document volumes. I know LSH is the most popular industry-standard algorithm for syntactical use cases but I'm trying to find better alternatives to the LSH. I've shortlisted the following algorithms or techniques to give a try but I would like to gain insights from experienced data scientists who tried the following or other techniques in production that gave high precision and recall values. It would be great if someone has any comparison or benchmark reports etc.

Multi Probe LSH
Spherical LSH
Earth Mover's Distance

PS:: I know there are KNN/ANN solutions that offer semantical near-dup detection techniques in machine learning. I'm specifically exploring near-dup detection techniques which do not require any training or building a feature vector to train. But open if some technique needs less effort on training or building vector too.

edited Jun 24 '22 at 10:17

asked Jun 19 '22 at 16:23

Murali Mopuru

Two "nears" in the question – Carlos Mougan Jul 19 '22 at 11:42

Near duplicate detection algorithms for a near real time system

0 Answers0