2

I am involved in twitter analysis data. I want to find trending topics in tweets with some hashtags, like #finance or #technology. I have a hugh data set of tweets and now I need to analyze them.

I need to recognize topics, if there are. They way I'm approaching this is, first, performing a vector representation of each tweet, with a tfidf technique, and then, build groups of them based on their cosine similarity.

Are there common techniques in tweets analysis?

Federico Caccia
  • 760
  • 6
  • 18

2 Answers2

3

I believe that the algorithm that you want to use is something called a latent dirichlet allocation (LDA) model. This model is designed to uncover the topics in a corpus of documents.

Scikit learn has an implementation.

They even have a tutorial which teaches you how to extract topics. The tutorial also describes Non-negative Matrix Factorization (NNMF) as a method to extract the topics. I can't vouch for this algorithm, because I haven't used it personally (as opposed to LDA which I have used before), but from their tutorial NNMF does seem to give reasonable results.

Using cosine similarity will help you to group tweets that are most similar, but it wouldn't give you their topics. Which may be what you want? It really is hard to say, because only you would know how you should have the system behave. Unfortunately, that doesn't help you figure out what is trending, and you will need to do some heavy post-processing to make whatever algorithm you use spit out something that is useful to you.

Good luck!

Ryan
  • 702
  • 3
  • 11
1

As mentioned by @Ryan the LDA is a way to go but I am not sure it will provide robust results on documents that are fundamentally limited to 140 characters in length. I tried it in the past on summaries of news articles and got mixed results. One alternative idea might be to test the performance of a supervised model like SVM or KNN when hash-tags are used as the classes?

As an aside if you are committed to the LDA check out the gensim and LDAviz packages in python.

Dan Temkin
  • 181
  • 1
  • 7
  • yes that packages are great, but the real thing is that LDA is not giving me good results. The best thing I found is, like you are saying, only take into account the hashtags. – Federico Caccia Apr 12 '18 at 19:05
  • I was more thinking of using the hashtags as an internalized classification for the rest of the text in the tweet. The thing you can toy with is combining hashtags by setting a "reasonable" difference under which two tags are considered equal and using a something basic like a string distance algorithm to determine the differences. This might generalize your results some more. Also consider using a stemmer, if you aren't already, to regularize the tags. – Dan Temkin Apr 12 '18 at 19:23