This is a broad question and maybe does not have an answer but I will try. I have been thinking of some techniques to detect the date of publication of public data in the wild of the internet. Without raising any defamation concerns, admitting data we are crawling comes from trust worthy sources.
The data I am concerned with is text of some length, I mean paragraphs not less than a page length, this is to ignore small omnipresent sentences.
Say for instance a news article from a news media website; What are techniques you can think of to "estimate" first appearance date. One obvious solution would be to check for a date which is logical depending on the time of the crawl. But what else can you think of ? I personally can think of no other way, but still I'm tempted to image other ways.
Excuse my curiosity, this doesn't come from a rushing business/scientific need but still, this can have its applications in the field of fake news detection.
Last attempts from me, would be to assume the crawler is so fast on it's target sites, and consequently any data discovered is tagged with the exact time of crawling (of course excluding news already in the database, which are copied from somewhere else (using hashes or perceptual hashes or something similar for unicity)).
