Fine-tuning pre-trained Word2Vec model with Gensim 4.0

Question

With Gensim < 4.0, we can retrain a word2vec model using the following code:

model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)

However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.

How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?

score 1 · Answer 1 · edited Jul 24 '23 at 15:48

You can try the following steps to fine-tune on your domain-specific corpus using Gensim 4.0:

Create a Word2Vec model with the same vector size as the pretrained model
```
w2vModel = Word2Vec(vector_size=..., min_count=..., ...)
```
Build the vocabulary for the new corpus
```
w2vModel.build_vocab(my_corpus)
```
Create a vector of ones that determine the mutability of the pretrained vectors. In the previous Gensim versions, this used to be a single lockf argument to the intersect_word2vec_format function. Using a vector of ones ensures that all the words in the vocabulary are updated during fine-tuning
```
w2vModel.wv.vectors_lockf = np.ones(len(w2vModel.wv))
```
Perform a vocabulary intersection using intersect_word2vec_format function to initialize the new embeddings with the pretrained embeddings for the words that are in the pretraining vocabulary.
```
w2vModel.wv.intersect_word2vec_format('pretrained.bin', binary=True)
```
I am quoting from the official Gensim documentation as follows intersect_word2vec_format ([1])(https://radimrehurek.com/gensim/models/keyedvectors.html):

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

Now, you can train the model on the new corpus

w2vModel.train(my_corpus, total_examples=len(my_corpus), epochs=...)

Fine-tuning pre-trained Word2Vec model with Gensim 4.0

1 Answers1

Linked