Detect named entities inside words

Question

Some languages have word endings with their nouns (like Finnish, e.g. "in Berlin" -> "Berliinissä"). I have tried to annotate the characters in the training data as entities, but then I run the model, it doesn't detect the characters inside the word. When those characters are a separate word, only then they're detected. I am unable to think of an implementation to effectively detect named entities within a word. Any suggestions would be helpful.

I was trying to detect "London" from "Londonschlüssel". But since my german is not so good, I later realized that it would be appropriate as "London-schüssel", which can be easily tokenized. — Hasan Shaukat, Feb 22 '19 at 10:45

Tom · Answer 1 · 2019-02-22T15:54:11.693

I would recommend to look into character level named entity recognition. For example: Kuru et al, CharNER: Character-Level Named Entity Recognition, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016)

The authors evaluate on many highly inflected languages including Turkish, so this should be adequate for your Finnish use case

The code is here: https://github.com/ozanarkancan/char-ner

You should hopefully be able to download and get it running out of the box for training. Of course I am assuming you have a tagged NER corpus in Finnish, which you would need to preprocess to get into the same format as the CSV file that they use for Czech in the repo.

Detect named entities inside words

1 Answers1