2

With Gensim < 4.0, we can retrain a word2vec model using the following code:

model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)

However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.

How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?

NST
  • 51
  • 4

1 Answers1

1

You can try the following steps to fine-tune on your domain-specific corpus using Gensim 4.0:

  1. Create a Word2Vec model with the same vector size as the pretrained model

    w2vModel = Word2Vec(vector_size=..., min_count=..., ...)
    
  2. Build the vocabulary for the new corpus

    w2vModel.build_vocab(my_corpus)
    
  3. Create a vector of ones that determine the mutability of the pretrained vectors. In the previous Gensim versions, this used to be a single lockf argument to the intersect_word2vec_format function. Using a vector of ones ensures that all the words in the vocabulary are updated during fine-tuning

    w2vModel.wv.vectors_lockf = np.ones(len(w2vModel.wv))
    
  4. Perform a vocabulary intersection using intersect_word2vec_format function to initialize the new embeddings with the pretrained embeddings for the words that are in the pretraining vocabulary.

    w2vModel.wv.intersect_word2vec_format('pretrained.bin', binary=True)
    

    I am quoting from the official Gensim documentation as follows intersect_word2vec_format ([1])(https://radimrehurek.com/gensim/models/keyedvectors.html):

    Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

    No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

  5. Now, you can train the model on the new corpus

    w2vModel.train(my_corpus, total_examples=len(my_corpus), epochs=...)
    
liakoyras
  • 626
  • 4
  • 15
Ishrak
  • 111
  • 3