R scraper with jsessionid

Question

R scraper with jsessionid

I am testing web-cleaning scripts in R. I have read many manuals, documents and tried different things, but have not yet succeeded.

The URL I'm trying to clear is this . It has state, government data and no allegations against the web scraper. It is in Portuguese, but I believe that it will not be a big problem.

A search form with several fields is displayed here. My test looked for data from a certain state ("RJ", in this case the "UF" field) and the city ("Rio de Janeiro", in the "MUNICIPIO" field). By clicking "Pesquisar" (Search), it displays the following result:

Using Firebug, I found that the URL it calls (using the options above):

http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10

The site uses jsessionid, as can be seen from the following:

 library(rvest) library(httr) url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/") cookies(url)

Knowing that it uses jsessionid, I used cookies (url) to verify this information and used it in a new url like this:

 url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2") html_text(url)

Well, there is no data on the output. He actually has an error message. The translation into English basically means that the session has expired.

I assume this is a major mistake, but I looked around and could not find a way to overcome this.

+5

r web-scraping httr rvest

Ricardo costa Dec 26 '15 at 0:13

source share

1 answer

hrbrmstr · Accepted Answer · 2015-12-26T04:31:09+0000

This combination worked for me:

 library(curl) library(xml2) library(httr) library(rvest) library(stringi) # warm up the curl handle start <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam") # get the cookies ck <- handle_cookies(handle_find("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")$handle) # make the POST request res <- POST("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=" %s+% ck[1,]$value, user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0"), accept("*/*"), encode="form", multipart=FALSE, # this gens a warning but seems to be necessary add_headers(Referer="http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam"), body=list(`buscaForm`="buscaForm", `codEntidadeDecorate:codEntidadeInput`="", `noEntidadeDecorate:noEntidadeInput`="", `descEnderecoDecorate:descEnderecoInput`="", `estadoDecorate:estadoSelect`=33, `municipioDecorate:municipioSelect`=3304557, `bairroDecorate:bairroInput`="", `pesquisar.x`=50, `pesquisar.y`=15, `javax.faces.ViewState`="j_id1")) doc <- read_html(content(res, as="text")) html_nodes(doc, "table") ## {xml_nodeset (5)} ## [1] <table border="0" cellpadding="0" cellspacing="0" class="rich-tabpanel " id="j_id17" sty ... ## [2] <table border="0" cellpadding="0" cellspacing="0">\n <tr>\n <td>\n <img alt="" ... ## [3] <table border="0" cellpadding="0" cellspacing="0" id="j_id18_shifted" onclick="if (RichF ... ## [4] <table border="0" cellpadding="0" cellspacing="0" style="height: 100%; width: 100%;">\n ... ## [5] <table border="0" cellpadding="10" cellspacing="0" class="dr-tbpnl-cntnt-pstn rich-tabpa ...

I used BurpSuite to check what was happening, and checked a quick check on the command line using the result “Copy as cURL” and adding --verbose so that I could check what was sent / received. Then I imitated curl parameters.

Starting from the open search page, cookies for the session identifier and the bigip server are already warming up (i.e. they will be sent with each request, so you do not need to link them), but you still need to fill it in the path of the URL, so we need to download them and then fill it out.

R scraper with jsessionid

More articles: