Smart sentence segmentation not splitting on abbreviations

Question

Sentencer from SpaCy and NLTK does not catch the fact that typical abbreviations (e.g. Mio. for Million in German) and the resulting sentence split is not correct. I understand that sentencers are supposed to be simple and quick but I am wondering if there is a better one that takes into account something more than uppercased words and punctuation? Alternatively, how to make SpaCy / NLTK / ... sentencer work for such sentences?

I am interested primarily with sentencers with Python API.

score 2 · Accepted Answer · answered Oct 13 '20 at 09:31

2

Neural tools trained on Universal Dependencies corpora use learned models for tokenization and sentence-spliting. Two I know of are:

UDPipe – developed at Charles University in Prague. Gets very good results (at least for parsing), but has a little unintuitive API.
Stanza – developed at Stanford University. The API is quite similar to Spacy.

However, they are quite slow compared to regex-based sentence-spliting.

answered Oct 13 '20 at 09:31

Jindřich

1,661
5
8

Thank you for the references. The speed is not an issue. – sophros Oct 13 '20 at 09:37

Smart sentence segmentation not splitting on abbreviations

1 Answers1