Document similarity

Question

I have close to 50000 documents in plain text format.

Is there a way in which I can group similar documents together? Similarity mostly here is the content similarity.

Will transforming the text into a vector (using TFIDF) and running a K-Means (unsupervised learning) algorithm on top of that help? Are there any better approaches that could be used?

If the answers suits you don't forget to upvote and check. – Carlos Mougan Jan 24 '20 at 11:06 — Carlos Mougan, Jan 24 '20 at 11:06

score 3 · Answer 1 · answered Jan 24 '20 at 10:51

I did something similar a while ago. We wanted to classify several types of pdf.

We first extracted the text of the documents.
We created NLP features with the text
Then added pdf metadata: size of the file, number of pages, name of the document...
We then built a classification model with a few samples and did Active Learning

I guess that you could also do unsupervised learning but I like it more when you can do supervised learning.

score 0 · Answer 2 · edited Jan 26 '20 at 00:11

0

A common approach for this is LDA (Latent Dirichlet Allocation), which not only gives you the groups, but also a way to identify the topics of the groups by giving you the most common or distinctive words for each topic.

edited Jan 26 '20 at 00:11

Ethan

1,625
8
23
39

answered Jan 26 '20 at 00:03

Schnipp

91
2

Document similarity

2 Answers2

Linked