How to clear information on click using R?

I am trying to scrape the phone number from this site: http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53 . The phone number can be rvest with the rvest package with the selector .\'id_raw\'\::nth-child(1) span+ div strong (suggested by [selectorGadget] http://selectorgadget.com/ ).

The problem is that information can be obtained after clicking on its mask. Somehow, I have to open the session, provide a click and then clean the information.

EDIT By the way, this is not an imho link. Look at the source. I have a problem because I am a regular R user and not a javascript programmer.

enter image description here

+5
source share
2 answers

You can capture the data embedded in the <li> tags that tell the onclick handler what to do, and simply get the data directly:

 library(httr) library(rvest) library(purrr) library(stringr) URL <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53" pg <- read_html(URL) html_nodes(pg, "li.rel") %>% # get the 'special' <li> tags html_attrs() %>% # extract all the attrs (they're non-standard) flatten_chr() %>% # list to character vector keep(~grepl("rel \\{", .x)) %>% # only want ones with 'hidden' secret data str_extract("(\\{.*\\})") %>% # only get the data unique() %>% # there are duplicates map_df(function(x) { path <- str_match(x, "'path':'([[:alnum:]]+)'")[,2] # extract out the path id <- str_match(x, "'id':'([[:alnum:]]+)'")[,2] # extract out the id ajax <- sprintf("http://olx.pl/ajax/misc/contact/%s/%s/", path, id) # make the AJAX/XHR URL value <- content(GET(ajax))$value # get the data data.frame(path=path, id=id, value=value, stringsAsFactors=FALSE) # make a data frame }) ## Source: local data frame [3 x 3] ## ## path id value ## (chr) (chr) (chr) ## 1 phone dX6wf 503 155 744 ## 2 skype dX6wf e.bobruk ## 3 communicator dX6wf 7686136 

Having done all this, I am very disappointed that the site does not have the best Terms of Use / Use. It's pretty obvious that they really don't want you to clear this data.

+6
source

Here's a solution using RSelenium , ( Introduction of RSelenium ) and phantomjs.

However, I'm not sure how useful it is, because it runs very slowly on my machine, and I'm not an expert on phantomjs or selenium, so I donโ€™t know where speed can be improved, so see something ...

Edit

I tried it again, and it seems that this is normal for speed.

 library(RSelenium) library(rvest) ## Terminal command to start selenium (on ubuntu) ## cd ~/selenium && java -jar selenium-server-standalone-2.48.2.jar url <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53" RSelenium::startServer() remDr <- remoteDriver(browserName = "phantomjs") remDr$open() remDr$navigate(url) # css <- ".cpointer:nth-child(1)" ## couldn't get this to work xp <- "//div[@class='contactbox-indent rel brkword']" webElem <- remDr$findElement(using = 'xpath', xp) # webElem <- remDr$findElement(using = 'css selector', css) webElem$clickElement() ## the page source now includes the clicked element page_source <- remDr$getPageSource()[[1]] pos <- regexpr('class=\\"xx-large', page_source) ## you could write a more intelligent regex, but this works for now phone_number <- substr(page_source, pos + 11, pos + 21) phone_number # "503 155 744" # remDr$close() # remDr$closeServer() 
+3
source

Source: https://habr.com/ru/post/1242963/


All Articles