How to use is_split_into_words with Huggingface NER pipeline

Question

I am using Huggingface transformers for NER, following this excellent guide: https://huggingface.co/blog/how-to-train.

My incoming text has already been split into words. When tokenizing during training/fine-tuning I can use tokenizer(text,is_split_into_words=True) to tokenize the incoming text. However, I can't figure out how to do the same in a pipeline for predictions.

For example, the following works (but requires incoming text to be a string):

s1 = "Here is a sentence"
p1 = pipeline("ner",model=model,tokenizer=tokenizer)
p1(s1)

But the following raises the following error: Exception: Impossible to guess which tokenizer to use. Please provide a PreTrainedTokenizer class or a path/identifier to a pretrained tokenizer.

s2 = "Here is a sentence".split()
toks = tokenizer(s2,is_split_into_words=True)
p2 = pipeline("ner",model=model)
p2(toks)

I don't want to join the incoming text into one sentence because whitespace is significant in my use case. Post-processing the outputs of the pipeline will be complicated if I just pass in one string rather than a list of words.

Any advice on how I can use is_split_into_words=True functionality in the pipeline?

Sadly not @MauriceSchleußinger - for now I'm just converting the list of tokens back into a string with `" ".join(foo)` which style is giving me good enough results in my use case. — Alan Buxton, Feb 22 '22 at 20:00

score 0 · Answer 1 · answered Feb 23 '22 at 23:16

0

If you are not set on this particular model for NER, there are some that work with multi-sentence texts straight away without any manual splitting:

answered Feb 23 '22 at 23:16

Maurice Schleußinger

101
1

Trail Map · Answer 2 · 2023-07-19T19:59:14.070

evalute.evaluator.token_classification.py has this line:

        data = data.map(lambda x: {input_column: join_by.join(x[input_column])})

So even though the HuggingFace evaluator takes as input text that is carefully split into tokens for token classification tasks, to actually generate a model prediction, under the covers it is just doing the same trick to turn a list of tokens into a single text string input for the NER pipeline:

" ".join(alist)

How to use is_split_into_words with Huggingface NER pipeline

2 Answers2