Date Extraction in Python

Question

I would like to extract all date information from a given document. Essentially, I guess this can be done with a lot of regexes:

Is there a Python package / library which offers this already or do i have to write all of those regexes/logic myself?

I'm interested in Information Extraction from German and English texts. Mainly German, though.

Constraints

I don't have the complete dataset by now, but I have some idea about it:

10 years of interesting dates which could be in the dataset
I guess the interesting date types are: (1) 28.02.2019, (2) relative ones like "3 days ago" (3) 28/02/2019, (4) 02/28/2019 (5) 2019-02-28 (6) 2019/02/28 (7) 2019/28/02 (8) 28.2.2019 (9) 28.2 (10) ... -- all of which could have spaces in various places
I have millions of documents. Every document has around 20 sentences, I guess.
Most of the data is in German

I had looked into this about 6 months ago and could not find anything that works out of the box for both English and German. What seemed promising was using some [fuzzy matching](https://github.com/seatgeek/fuzzywuzzy), given you can make some half-descent assumptions about the possible formats, as in your examples. The same would go for a regex solution, I suppose. You could combine the approaches even. — n1k31t4, Feb 20 '19 at 09:54
fuzzywuzzy up to my knowledge is a bad match, as it essentially uses the Levensthein distance. For dates I need regexes ... Although I could list all reasonable dates (10 years = 3653 elements) and all formats I'm interested in (maybe 10), doing fuzzy matching for roughly 36'530 elements over millions of documents is not feasible. — Martin Thoma, Feb 20 '19 at 13:36
I agree it isn't optimal, but using heuristic parameters could work fairly well (it did for me). You could brute force it as you suggest - you hadn't mentioned millions of documents. Being more specific; it is really the number of tokens which is important (how big is a document?). Perhaps you could update your question to include those additional computation considerations/constraints. — n1k31t4, Feb 20 '19 at 14:13

score 8 · Answer 1 · answered Feb 20 '19 at 06:27

8

Stanford CoreNLP has a very good implementation of NER for date/time.

Though this is written in Java, there are quite a few Python wrappers for this library (Such as : https://github.com/FraBle/python-sutime). List of such libraries : https://stanfordnlp.github.io/CoreNLP/other-languages.html

answered Feb 20 '19 at 06:27

Shamit Verma

At least the web interface only offers English. – Martin Thoma Feb 20 '19 at 07:21
These Languages are built-in : https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/time/rules . You can look for German rules (if someone has created it) OR estimate work required for writing these based on number of rules in other language files. – Shamit Verma Feb 20 '19 at 08:32

score 2 · Answer 2 · answered Mar 03 '19 at 04:42

2

Spacy (https://spacy.io) comes with both English and German language model.

According to the documentation, it's NER works for both absolute as well as the relative date. https://spacy.io/usage/linguistic-features#section-named-entities

answered Mar 03 '19 at 04:42

Louis T