3

I am using Huggingface transformers for NER, following this excellent guide: https://huggingface.co/blog/how-to-train.

My incoming text has already been split into words. When tokenizing during training/fine-tuning I can use tokenizer(text,is_split_into_words=True) to tokenize the incoming text. However, I can't figure out how to do the same in a pipeline for predictions.

For example, the following works (but requires incoming text to be a string):

s1 = "Here is a sentence"
p1 = pipeline("ner",model=model,tokenizer=tokenizer)
p1(s1)

But the following raises the following error: Exception: Impossible to guess which tokenizer to use. Please provide a PreTrainedTokenizer class or a path/identifier to a pretrained tokenizer.

s2 = "Here is a sentence".split()
toks = tokenizer(s2,is_split_into_words=True)
p2 = pipeline("ner",model=model)
p2(toks)

I don't want to join the incoming text into one sentence because whitespace is significant in my use case. Post-processing the outputs of the pipeline will be complicated if I just pass in one string rather than a list of words.

Any advice on how I can use is_split_into_words=True functionality in the pipeline?

Alan Buxton
  • 131
  • 4

2 Answers2

0

If you are not set on this particular model for NER, there are some that work with multi-sentence texts straight away without any manual splitting:

0

evalute.evaluator.token_classification.py has this line:

        data = data.map(lambda x: {input_column: join_by.join(x[input_column])})

So even though the HuggingFace evaluator takes as input text that is carefully split into tokens for token classification tasks, to actually generate a model prediction, under the covers it is just doing the same trick to turn a list of tokens into a single text string input for the NER pipeline:

" ".join(alist)

Trail Map
  • 1
  • 1