1

This is a broad question and maybe does not have an answer but I will try. I have been thinking of some techniques to detect the date of publication of public data in the wild of the internet. Without raising any defamation concerns, admitting data we are crawling comes from trust worthy sources.

The data I am concerned with is text of some length, I mean paragraphs not less than a page length, this is to ignore small omnipresent sentences.

Say for instance a news article from a news media website; What are techniques you can think of to "estimate" first appearance date. One obvious solution would be to check for a date which is logical depending on the time of the crawl. But what else can you think of ? I personally can think of no other way, but still I'm tempted to image other ways.

Excuse my curiosity, this doesn't come from a rushing business/scientific need but still, this can have its applications in the field of fake news detection.

Last attempts from me, would be to assume the crawler is so fast on it's target sites, and consequently any data discovered is tagged with the exact time of crawling (of course excluding news already in the database, which are copied from somewhere else (using hashes or perceptual hashes or something similar for unicity)).

bacloud14
  • 453
  • 5
  • 13

1 Answers1

2

Go digging through the html - you will find consistent tags/styles/formats across multiple sites.

For example: From the bbc we can see a datetime/timestamp enter image description here

As always, someone has already done this for you in the htmldate package, built on dateparser which you can use directly as well.

GooJ
  • 435
  • 2
  • 11
  • 2
    I am amused that somebody done that, not for anything ingenious but as I found no references in academia. Maybe because you can never be sure of such thing even relatively, but this is so practical, so nice package ! – bacloud14 Sep 10 '22 at 21:48