2

For a university project, I chose to do sentiment analysis on a Google Play store reviews dataset. I obtained decent results classifying the data using the bag of words (BOW) model and an ADALINE classifier.

I would like to improve my model by incorporating bigrams relevant to the topic (Negative or Positive) in my features set. I found this paper which uses KL divergence to measure the relevance of unigrams/bigrams relative to a topic.

The only problem is that I am having trouble understanding what C refers to in the equation (2.2). Does it refer to the unique words associated with topic C, the set of documents on a topic, or the words in a document?

Ethan
  • 1,625
  • 8
  • 23
  • 39
Balocre
  • 23
  • 3

1 Answers1

0

Since those are academic researchers, they framed the problem in the most general way possible. The $C$ term could be any random variable to be modeled. In this specific case, $C$ is the individual tokens (unigrams or bigrams).

I have found empirical improvement by including bigrams highly ranked by collocations, frequently occurring n-grams. By including common phrases, a model can better capture how language is used in that specific context. Finding collocations is relatively straightforward - rank the occurrence of all n-grams, then set a threshold to limit to only the most popular.

Those authors are looking for unique information which far more complex to model and often not necessary for model lift.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102