I'm trying to use rvest to screen scrape headline news items from google and failing.
Having previously written a utility to screen scrape high level stats from DS.SE (not user info I have to say!), which runs successfully, I know that my technique works, but this produces nothing.
I'm using SelectorGadget to find the screen element needed (".r a") and the code snippet below should read the headlines from the google page for "education will" search term.
My code (reduced down to bare bones) is:-
library (rvest)
url = "https://www.google.co.uk/#q=%22education+value%22"
url
index_html <- read_html( url )
headline_tmp <- index_html %>%
html_nodes (".r a") %>%
html_text()
headline_tmp
Giving the output as
> library (rvest)
>
> url = "https://www.google.co.uk/#q=%22education+value%22"
> url
[1] "https://www.google.co.uk/#q=%22education+value%22"
>
> index_html <- read_html( url )
>
> headline_tmp <- index_html %>%
+ html_nodes (".r a") %>%
+ html_text()
> headline_tmp
>character(0)
When I run with the stackexchange URL, suitably modified code for the different URL, would give a vector of data.
The reading I've done about google stopping screen scrapers, suggests that they only stop people who abuse it and usually will do it with a Captcha or similar.
Any thoughts on a solution?