9

I want to use Latent Dirichlet Allocation for a project and I am using Python with the gensim library. After finding the topics I would like to cluster the documents using an algorithm such as k-means(Ideally I would like to use a good one for overlapping clusters so any recommendation is welcomed). I managed to get the topics but they are in the form of:

0.041*Minister + 0.041*Key + 0.041*moments + 0.041*controversial + 0.041*Prime

In order to apply a clustering algorithm, and correct me if I'm wrong, I believe I should find a way to represent each word as a number using either tfidf or word2vec.

Do you have any ideas of how I could "strip" the textual information from e.g. a list, in order to do so and then place them back in order to make the appropriate multiplication?

For instance the way I see it if the word Minister has a tfidf weight of 0.042 and so on for any other word within the same topic I should be to compute something like:

0.041*0.42 + ... + 0.041*tfidf(Prime) and get a result that will be later on used in order to cluster the results.

Thank you for your time.

Swan87
  • 211
  • 1
  • 2
  • 3
  • 1
    [As explained in the tutorial](http://radimrehurek.com/gensim/tut2.html), you can express documents as vectors. Cluster those vectors. – Emre Nov 13 '14 at 18:54
  • I know mate but I have to cluster them according to the topics created after I apply LDA on my collection. Each topic should be represented as a vector in order to compare each document with each topic and find the correspondent topic or topics for each doc. – Swan87 Nov 14 '14 at 11:03
  • You don't have to represent each word as a vector. You get the new representation for the entire *document* by applying the LDA transformation you learned *to the corpus*. For an example with LSI, see this link: http://radimrehurek.com/gensim/tut2.html The key part is where they apply the learned LSI transformation to the entire corpus with lsi[doc_bow] – Will Stanton Jun 13 '15 at 02:29

3 Answers3

5

Assuming that LDA produced a list of topics and put a score against each topic for each document, you could represent the document and it's scores as a vector:

Document | Prime | Minister | Controversial | TopicN | ...
   1       0.041    0.042      0.041          ...
   2       0.052    0.011      0.042          ...

To get the scores for each document, you can run the document. as a bag of words, through a trained LDA model. From the gensim documentation:

>>> lda = LdaModel(corpus, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document

Then, you could run the k-means on this matrix and it should group documents that are similar together. K-means by default is a hard clustering algorithm implying that it classifies each document into one cluster. You could use soft clustering mechanisms that will give you a probability score that a document fits within a cluster - this is called fuzzy k-means. https://gist.github.com/mblondel/1451300 is a Python gist showing how you can do it with scikit learn.

ps: I cant post more than 2 links

Ash
  • 181
  • 1
  • 5
  • I tried to do that for "n" no of documents say where topics are t. However for say x no of documents , all t topics prob do not shows up just some ( t- no) topics prob shows up, where 1<=no< t. It does not happen when I run the experiment on small document size. Is it because it doesn't print at all if the prob is 0 ? – Manish Ranjan Mar 07 '16 at 17:40
0

Complementary to the previous answer you should better not just run kmeans directly on the compositional data derived from the lda topic-doc distribution, instead use some compositional data transformation to project them to the euclidean space like ilr or clr.

(Example)

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
0

Another approach would be to use the document-topic matrix that you obtained by training the LDA model in order to extract the topic with the maximum probability and let that topic be your label.

This will give a result that is somewhat interpretable to the degree your topics are.