Problem Screen Scraping Google Data

Question

I'm trying to use rvest to screen scrape headline news items from google and failing.

Having previously written a utility to screen scrape high level stats from DS.SE (not user info I have to say!), which runs successfully, I know that my technique works, but this produces nothing.

I'm using SelectorGadget to find the screen element needed (".r a") and the code snippet below should read the headlines from the google page for "education will" search term.

My code (reduced down to bare bones) is:-

library (rvest)

url = "https://www.google.co.uk/#q=%22education+value%22"
url

index_html <- read_html( url )

headline_tmp <- index_html %>%
  html_nodes (".r a") %>%
  html_text()
headline_tmp

Giving the output as

> library (rvest)
> 
> url = "https://www.google.co.uk/#q=%22education+value%22"
> url
[1] "https://www.google.co.uk/#q=%22education+value%22"
> 
> index_html <- read_html( url )
> 
> headline_tmp <- index_html %>%
+   html_nodes (".r a") %>%
+   html_text()
> headline_tmp
>character(0)

When I run with the stackexchange URL, suitably modified code for the different URL, would give a vector of data.

The reading I've done about google stopping screen scrapers, suggests that they only stop people who abuse it and usually will do it with a Captcha or similar.

Any thoughts on a solution?

Problem Screen Scraping Google Data

0 Answers0