0

I have a datadet with many phrases which I would like to embed them from scratch. I dont want the cosine of the words in order to get a phrase embedding, this is because the phrases may appear in a different enviroment and I want to embed the two words together, or the tree words together in their own envoroment.

Is this possible?

If yes how exactly?

Thank You in advance.

1 Answers1

0

There are several ways word-embeddings are trained, however most of them require a ton of data. They usually involve learning vector representations that are useful for some self-supervised objective, which all tend to be pretty data-hungry.

  • word2vec (and variants) learn representations by training a model to use those representations to predict adjacent words
  • Approaches like ELMo and BERT use intermediate representations from a language model, which are pretrained on large text corpora

If you have a large enough dataset you could train new domain-specific embeddings from scratch, but it probably be more effective and much easier to finetune existing embeddings (i.e., initialize on models/representations and train on your domain data).

See: this post for finetuning word2vec, for example.