I have a big dataset consisting of tweets including hashtags and I want to build a hashtag-based similarity engine to get the most similar tweets given a set of hashtags.
In the end I would like to have some kind of "hashtag to vector embedding" model (should work like a language embedding model) which outputs a comparable vector(in a decent length), based on a set of input hashtags. Later I also want to train a classifier based on those vectors.
One idea would be to fit a TF IDF Vectorizer, do some dimension reduction and then take the cosine similarity/jaccard similarity between a query vector and the tweet vectors.
However, there come some problems with this solution and I feel like there are some better solutions for the problem, do you have any other solutions/pretrained model for the problem which you can recommend or I should try?
- The model should not capture the hashtag's semantic meaning but should just capture the statistical relation between the same hashtags (without preprocessing,stemming/lemmatization, grammar,...)
- The weights of single hashtags are important - the more frequent a hashtag is the less impact it should have on the vector
- The order of the tokens/hashtags does not matter