5

I have two sets of newspaper articles where I train the first newspaper dataset separately to get the topics per each newspaper article.

E.g., first newspaper dataset
article_1 = {'politics': 0.1, 'nature': 0.8, ..., 'sports':0, 'wild-life':1}

Again, I train my second newspaper dataset (from a different distributor) to get the topics per each newspaper article.

E.g., second newspaper dataset (from a different distributor)
article_2 = {'people': 0.3, 'animals': 0.7, ...., 'business':0.7, 'sports':0.2}

As shown in the examples, the topics I get from the two datasets are different, thus I manually matched similar topics based on their frequent words.

I want to identify whether the two newspaper distributors publish the same news in every week.

Hence, I am interested in knowing if there is a systematic way of comparing the topics across two corpora and measuring their similarity. Please help me.

Smith
  • 529
  • 1
  • 5
  • 14
  • 1
    Interesting question. what is the technique you used for topic-modeling? – Volka Oct 13 '17 at 01:07
  • I am actually using my own algorithm for it. I hope the choice of my algorithm does not affect this question :) – Smith Oct 13 '17 at 01:12
  • 1
    I would find a pooled topic model, then compare the individual distributions (e.g., by KLD). – Emre Oct 16 '17 at 01:07
  • @Emre Thanks a lot. What did you mean by pooled topic model? Moreover, what is KLD :) – Smith Oct 16 '17 at 01:30
  • 1
    I mean find the topics assuming all the articles come from a common source. KLD is https://en.wikipedia.org/wiki/Kullback–Leibler_divergence – Emre Oct 16 '17 at 01:44
  • @Emre Thanks a lot. Please tell me how to compare the distributions using KLD? – Smith Oct 16 '17 at 07:46
  • 1
    I would use a hierchical model that defines the newspaper Dirichlet distributions based on the pooled topic model. You should be able to implement this in Edward; see [this GitHub discussion](https://github.com/blei-lab/edward/issues/473). Once you have the newspaper distributions, the difference can be defined by the KLD or the symmetric [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) as follows: http://bariskurt.com/kullback-leibler-divergence-between-two-dirichlet-and-beta-distributions/ Good luck! – Emre Oct 17 '17 at 04:30
  • @Emre Thanks a lot. Since I am not good in my Mathematics I barely understands your second link. Is there any python project or tutorial that you recommend to look at? – Smith Oct 17 '17 at 23:29
  • 2
    [This](https://www.coursera.org/learn/text-mining/lecture/deiXc/3-9-latent-dirichlet-allocation-lda-part-1) lecture might help. Read about [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution), and [divergence](https://en.wikipedia.org/wiki/Divergence_(statistics)) measures. – Emre Oct 17 '17 at 23:59
  • @Emre Do we essentially need a probability distribution to use Kullback-Leibler divergence? What are the properties of the topics need to be in order to use this? :) – Smith Volka Oct 30 '17 at 02:06
  • 1
    You need two topics whose distributions have equal [support](https://en.wikipedia.org/wiki/Support_(mathematics)). – Emre Oct 30 '17 at 04:08
  • @Emre Can you please give me an example of equal supporting two topics distributions? :) – Smith Volka Oct 31 '17 at 00:47
  • 1
    Two Dirichlet distributions with equal order. You'd be forming a mixture over the topics. You can learn more by searching for _hierarchical LDA/topic models_. – Emre Oct 31 '17 at 03:20
  • @Emre Thanks a lot! according to my current understanding, to apply KLD we need to have probabilistic vectors where the sum is equals to 1. Please correct me if I am wrong? :) – Smith Volka Nov 01 '17 at 06:02
  • 1
    Or integrates to one, for continuous distributions. – Emre Nov 01 '17 at 07:11
  • @Emre If we are using word2vec embedding for this, is it correct if we normalize and compare with the remaining word2vec embeddings? For example, if my word2vec is [0.3, 0.5, 0.7], I normalize it -> [0.2, 0.3333, 0.4666] and comapre with other normalised word2vec embeddings? – Smith Volka Nov 01 '17 at 23:05
  • 1
    Don't confuse word embeddings with topics. – Emre Nov 02 '17 at 02:06
  • W@ u mean is we can't use KLD for word embeddings?:) – Smith Volka Nov 02 '17 at 04:46
  • 1
    No, use the cosine similarity. – Emre Nov 02 '17 at 05:30
  • @Emre Thanks a lot! So the summary is we can use KLD for probability distributions and cosine for word embeddings. Please correct me if I am wrong :) – Smith Volka Nov 02 '17 at 14:31
  • 1
    Yes. I think you would benefit from studying linear algebra. – Emre Nov 02 '17 at 16:33

2 Answers2

2

One method to compare the topics across two corpora and measuring their similarity is with Kullback-Leibler divergence, aka relative entropy. Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second probability distribution.

Another, more scalable algorithm can be found in Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • Thanks a lot for your great answer. Can you please elaborate how to use your second option (http://vis.stanford.edu/topic-diagnostics) as it is not very clear to me (as how to apply it to my problem) – Smith Oct 21 '17 at 13:28
1

Considering you extracted the news using a TF-IDF approach what you have as a result is just one feature (frequency of terms). I think you would need to add more features to your corpus to be able to match two news as the same (or similar).

One new feature would be the temporal, where you would add a timestamp for the news. It will allow you check if two news (from different publishers) were published on the same period. (a week, 2 weeks, etc).

The second one could be a spatial, if you have geolocation of the news for example, you can add it to your train dataset. Something similar was done by Chi-Chun Pan. It would allow you be more confident if the news happened on the same place.

Elder Santos
  • 111
  • 2