1

I have a big dataset consisting of tweets including hashtags and I want to build a hashtag-based similarity engine to get the most similar tweets given a set of hashtags.

In the end I would like to have some kind of "hashtag to vector embedding" model (should work like a language embedding model) which outputs a comparable vector(in a decent length), based on a set of input hashtags. Later I also want to train a classifier based on those vectors.

One idea would be to fit a TF IDF Vectorizer, do some dimension reduction and then take the cosine similarity/jaccard similarity between a query vector and the tweet vectors.

However, there come some problems with this solution and I feel like there are some better solutions for the problem, do you have any other solutions/pretrained model for the problem which you can recommend or I should try?

  • The model should not capture the hashtag's semantic meaning but should just capture the statistical relation between the same hashtags (without preprocessing,stemming/lemmatization, grammar,...)
  • The weights of single hashtags are important - the more frequent a hashtag is the less impact it should have on the vector
  • The order of the tokens/hashtags does not matter
Michael S
  • 11
  • 2

0 Answers0