3

Sentencer from SpaCy and NLTK does not catch the fact that typical abbreviations (e.g. Mio. for Million in German) and the resulting sentence split is not correct. I understand that sentencers are supposed to be simple and quick but I am wondering if there is a better one that takes into account something more than uppercased words and punctuation? Alternatively, how to make SpaCy / NLTK / ... sentencer work for such sentences?

I am interested primarily with sentencers with Python API.

sophros
  • 209
  • 2
  • 11

1 Answers1

2

Neural tools trained on Universal Dependencies corpora use learned models for tokenization and sentence-spliting. Two I know of are:

  • UDPipe – developed at Charles University in Prague. Gets very good results (at least for parsing), but has a little unintuitive API.

  • Stanza – developed at Stanford University. The API is quite similar to Spacy.

However, they are quite slow compared to regex-based sentence-spliting.

Jindřich
  • 1,661
  • 5
  • 8