8

I would like to extract all date information from a given document. Essentially, I guess this can be done with a lot of regexes:

  • 2019-02-20
  • 20.02.2019 ("German format")
  • 02/2019 ("February 2019")
  • "tomorrow" (datetime.timedelta(days=1))
  • "yesterday" (datetime.timedelta(days=-1))

Is there a Python package / library which offers this already or do i have to write all of those regexes/logic myself?

I'm interested in Information Extraction from German and English texts. Mainly German, though.

Constraints

I don't have the complete dataset by now, but I have some idea about it:

  • 10 years of interesting dates which could be in the dataset
  • I guess the interesting date types are: (1) 28.02.2019, (2) relative ones like "3 days ago" (3) 28/02/2019, (4) 02/28/2019 (5) 2019-02-28 (6) 2019/02/28 (7) 2019/28/02 (8) 28.2.2019 (9) 28.2 (10) ... -- all of which could have spaces in various places
  • I have millions of documents. Every document has around 20 sentences, I guess.
  • Most of the data is in German
Martin Thoma
  • 18,630
  • 31
  • 92
  • 167
  • I had looked into this about 6 months ago and could not find anything that works out of the box for both English and German. What seemed promising was using some [fuzzy matching](https://github.com/seatgeek/fuzzywuzzy), given you can make some half-descent assumptions about the possible formats, as in your examples. The same would go for a regex solution, I suppose. You could combine the approaches even. – n1k31t4 Feb 20 '19 at 09:54
  • fuzzywuzzy up to my knowledge is a bad match, as it essentially uses the Levensthein distance. For dates I need regexes ... Although I could list all reasonable dates (10 years = 3653 elements) and all formats I'm interested in (maybe 10), doing fuzzy matching for roughly 36'530 elements over millions of documents is not feasible. – Martin Thoma Feb 20 '19 at 13:36
  • I agree it isn't optimal, but using heuristic parameters could work fairly well (it did for me). You could brute force it as you suggest - you hadn't mentioned millions of documents. Being more specific; it is really the number of tokens which is important (how big is a document?). Perhaps you could update your question to include those additional computation considerations/constraints. – n1k31t4 Feb 20 '19 at 14:13

2 Answers2

8

Stanford CoreNLP has a very good implementation of NER for date/time.

https://nlp.stanford.edu/software/sutime.html (demo: http://nlp.stanford.edu:8080/sutime/process)

enter image description here

Though this is written in Java, there are quite a few Python wrappers for this library (Such as : https://github.com/FraBle/python-sutime). List of such libraries : https://stanfordnlp.github.io/CoreNLP/other-languages.html

Shamit Verma
  • 2,239
  • 1
  • 8
  • 14
  • At least the web interface only offers English. – Martin Thoma Feb 20 '19 at 07:21
  • These Languages are built-in : https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/time/rules . You can look for German rules (if someone has created it) OR estimate work required for writing these based on number of rules in other language files. – Shamit Verma Feb 20 '19 at 08:32
2

Spacy (https://spacy.io) comes with both English and German language model.

According to the documentation, it's NER works for both absolute as well as the relative date. https://spacy.io/usage/linguistic-features#section-named-entities

Louis T
  • 1,148
  • 8
  • 22