0

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories?

somewhere i read instead of having local database of the entire internet, one can index and use it for faster search.

I know Elastic Search seems to be usable. Anyone has tried before?

Shiva
  • 9
  • 2

1 Answers1

1

I want to have a local database of corpus of the whole internet

Are you Google? If not storage might be an issue ;)

The PAN series have run various tasks related to plagiarism detection in the past: https://pan.webis.de/tasks.html#task-originality. I think they provide annotated datasets and they used to provide a live search engine.

Erwan
  • 24,823
  • 3
  • 13
  • 34
  • ha ha I dont want to be a google. I just thought local means will be faster. the above link not seems to be very useful. Seeing something like Elastic Search – Shiva Jul 03 '19 at 18:25
  • 1
    Sure local storage is faster, but do you have enough hard drives for the whole internet? "To store this amount of data you would need 700 million 4TB hard drives" https://www.live-counter.com/how-big-is-the-internet/ – Erwan Jul 03 '19 at 18:54
  • The link I gave you is an international benchmark used by researchers in the field of plagiarism detection, as far as I know this is the reference for this task. – Erwan Jul 03 '19 at 18:58
  • somewhere i read instead of having local database of the entire internet, one can index and use it for faster search. Is it so? – Shiva Jul 14 '19 at 17:15
  • It's correct that it's faster to have a local database. But to index the whole internet one needs around a billion of hard drives, like Google. Elastic Search by itself is only a search engine, it searches through whatever local data you have. That's why the PAN competition evaluated plagiarism systems not only on their accuracy but also on how many searches they require. There's been a whole lot of research done in this area, I'd suggest you take a look at it. – Erwan Jul 14 '19 at 17:48