2

I am doing NER using Bert Model. I have encountered some words in my datasets which is not a part of bert vocabulary and i am getting the same error while converting words to ids. Can someone help me in this?

Below is the code i am using for bert.

df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

tokens_list=['hrct',
 'heall',
 'government',
 'of',
 'hem',
 'snehal',
 'sarjerao',
 'nawale',
 '12',
 '12',
 '9999',
 'female',
 'mobile',
 'no',
 '1155812345',
 '3333',
 '3333',
 '3333',
 '41st',
 '3iteir',
 'fillow']

max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding  flasges -[CLS] and [SEP]: ")
print(input_sequence)


tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)
```
AMIT KUMAR
  • 23
  • 1
  • 4
  • 4
    BERT uses subword vocabularies, and normally have no out-of-vocabulary word problems (see [How pre-trained BERT model generates word embeddings for out of vocabulary words?](https://datascience.stackexchange.com/questions/85566/how-pre-trained-bert-model-generates-word-embeddings-for-out-of-vocabulary-words) ). – noe Mar 04 '21 at 08:31
  • No, I am still getting error – AMIT KUMAR Mar 04 '21 at 08:32
  • What specific errors are you getting? With what input words? – noe Mar 04 '21 at 08:32
  • the word is 'hrct' and it is getting error that it has no any key value. – AMIT KUMAR Mar 04 '21 at 08:33
  • I fear , i have many words which will not be in vocabulary – AMIT KUMAR Mar 04 '21 at 08:34
  • Given that BERT does not have OOV word problems with Latin script words, I think this may be related to the BERT implementation you are using or to how you are using it. If you copy here the code you are using, we may be able to help better. – noe Mar 04 '21 at 08:40
  • I am updating my question with the code – AMIT KUMAR Mar 04 '21 at 08:44
  • @noe I have updated the code , please have a look. – AMIT KUMAR Mar 04 '21 at 08:50
  • Please consider marking the answer as correct if deemed so. – noe Mar 04 '21 at 11:53

1 Answers1

3

The problem is that you are not using BERT's tokenizer properly.

Instead of using BERT's tokenizer to actually tokenize the input text, you are splitting the text in tokens yourself, in your token_list and then requesting the tokenizer to give you the IDs of those tokens. However, if you provide tokens that are not part of the BERT subword vocabulary, it will not be able to handle them.

You must not do this.

Instead, you should let the tokenizer tokenize the text and then ask for the token IDs, e.g.:

tokens_list = tokenizer.tokenize('Where are you going?') 

Remember, nevertheless, that BERT uses subword tokenization, so it will split the input text so that it can be represented with the subwords in its vocabulary.

noe
  • 22,074
  • 1
  • 43
  • 70