twitter data analysis?

Question

I am involved in twitter analysis data. I want to find trending topics in tweets with some hashtags, like #finance or #technology. I have a hugh data set of tweets and now I need to analyze them.

I need to recognize topics, if there are. They way I'm approaching this is, first, performing a vector representation of each tweet, with a tfidf technique, and then, build groups of them based on their cosine similarity.

Are there common techniques in tweets analysis?

@Aditya yes, I extended my question explaining what I'm doing now. Since I am new in this, I need to know if there are common techniques in twitter data analysis to recognize trending topics. — Federico Caccia, Apr 10 '18 at 19:08
Yes, I know there is too much noise out there, but I need to perform this study just complementing another studies and then evaluate if there is some data I can retrieve. — Federico Caccia, Apr 11 '18 at 19:44

score 3 · Accepted Answer · answered Apr 10 '18 at 20:49

I believe that the algorithm that you want to use is something called a latent dirichlet allocation (LDA) model. This model is designed to uncover the topics in a corpus of documents.

Scikit learn has an implementation.

They even have a tutorial which teaches you how to extract topics. The tutorial also describes Non-negative Matrix Factorization (NNMF) as a method to extract the topics. I can't vouch for this algorithm, because I haven't used it personally (as opposed to LDA which I have used before), but from their tutorial NNMF does seem to give reasonable results.

Using cosine similarity will help you to group tweets that are most similar, but it wouldn't give you their topics. Which may be what you want? It really is hard to say, because only you would know how you should have the system behave. Unfortunately, that doesn't help you figure out what is trending, and you will need to do some heavy post-processing to make whatever algorithm you use spit out something that is useful to you.

Good luck!

That tutorial is great! I'm going to adapt it with my own dataset. Thanks! — Federico Caccia, Apr 11 '18 at 14:07
It's a good start, however, these models need the number of topics in the dataset, which are unknown. — Federico Caccia, Apr 11 '18 at 14:27
you can use the log-likelihood (the score method of the LDA class) and a version of an elbow rule to determine the optimal number of topics. — Ryan, Apr 11 '18 at 17:17
ok, choosing the optimal number of topics based in some metrics, like coherence score, could be very useful. Thanks! — Federico Caccia, Apr 11 '18 at 18:25

score 1 · Answer 2 · answered Apr 12 '18 at 18:39

1

As mentioned by @Ryan the LDA is a way to go but I am not sure it will provide robust results on documents that are fundamentally limited to 140 characters in length. I tried it in the past on summaries of news articles and got mixed results. One alternative idea might be to test the performance of a supervised model like SVM or KNN when hash-tags are used as the classes?

As an aside if you are committed to the LDA check out the gensim and LDAviz packages in python.

answered Apr 12 '18 at 18:39

Dan Temkin

181
1
7

yes that packages are great, but the real thing is that LDA is not giving me good results. The best thing I found is, like you are saying, only take into account the hashtags. – Federico Caccia Apr 12 '18 at 19:05
I was more thinking of using the hashtags as an internalized classification for the rest of the text in the tweet. The thing you can toy with is combining hashtags by setting a "reasonable" difference under which two tags are considered equal and using a something basic like a string distance algorithm to determine the differences. This might generalize your results some more. Also consider using a stemmer, if you aren't already, to regularize the tags. – Dan Temkin Apr 12 '18 at 19:23

twitter data analysis?

2 Answers2