0

I'm using an unlabeled news corpus to fine-tune a multi-lingual BERT model. After that I'm using those embeddings to generate embeddings for words present in a new labeled dataset. These new embeddings will be fed to an RNN as initial weights. I want to save the embeddings of all words in the labeled dataset in a matrix. The number of rows in the matrix is the number of unique words in the labeled dataset and the number of the columns of the matrix is the dimension of the embedding vector. How can I do that?

I've shared a similar code for generating the embedding matrix for word2vec model:

MAX_NB_WORDS = 200000
embed_dim = embedding_size
words_not_found = []
nb_words = min(MAX_NB_WORDS, len(word_index)) 
embedding_matrix = np.random.rand(nb_words+1, embed_dim) #no. of unique words in the labeled data=nb_words+1

    for word, i in word_index.items(): #word_index contains the indices of the word tokens in labeled data
        if i >= nb_words:
            continue
        #print(word)
    
        if embeddings_index.wv.__contains__(word): #embeddings_index contains the indices of the words and corresponding embeddings in the unlabeled data
    
            embedding_vector = embeddings_index.wv[word]
        
            embedding_matrix[i] = embedding_vector
        else:
            words_not_found.append(word)   

Plz help me to convert the same code for a multi-lingual BERT model.

EDIT

After going through @noe 's comment I'm not sure that it can be achieved at all. So, I reframed my question. Answer to any one of the questions will help me. New question is given below.

I'm using an unlabeled news corpus to fine-tune a multi-lingual BERT model. After that I want to use those embeddings to generate embeddings for words present in a new labeled dataset. These new embeddings will be fed to an RNN. How can I achieve that?

I'm just a rookie. Plz share some code snippets.

Debbie
  • 101
  • 2
  • 2
    Note that, unlike word2vec, BERT does not give you word vectors but **subword** vectors. You can check [this answer](https://datascience.stackexchange.com/a/85570/14675) and [this other answer](https://datascience.stackexchange.com/a/85524/14675) and [yet another one](https://datascience.stackexchange.com/a/102110/14675) for an explanation of the difference. – noe May 03 '23 at 18:46
  • @noe thanks a lot. I understood the diff and edited my question. Cud you plz help me by sharing some code snippets? – Debbie May 03 '23 at 19:04
  • Also, consider that, unlike word2vec which is context-agnostic, the benefit of BERT is generating contextual representations. That is, the representations generated for a token in sentence A are different from those for the same token in sentence B. How are you planning to take that into account in your setting? – noe May 04 '23 at 06:01

0 Answers0