2

I have close to 50000 documents in plain text format.

Is there a way in which I can group similar documents together? Similarity mostly here is the content similarity.

Will transforming the text into a vector (using TFIDF) and running a K-Means (unsupervised learning) algorithm on top of that help? Are there any better approaches that could be used?

Ethan
  • 1,625
  • 8
  • 23
  • 39
praneeth
  • 149
  • 4

2 Answers2

3

I did something similar a while ago. We wanted to classify several types of pdf.

  • We first extracted the text of the documents.
  • We created NLP features with the text
  • Then added pdf metadata: size of the file, number of pages, name of the document...
  • We then built a classification model with a few samples and did Active Learning

I guess that you could also do unsupervised learning but I like it more when you can do supervised learning.

Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
0

A common approach for this is LDA (Latent Dirichlet Allocation), which not only gives you the groups, but also a way to identify the topics of the groups by giving you the most common or distinctive words for each topic.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Schnipp
  • 91
  • 2