I am testing web-cleaning scripts in R. I have read many manuals, documents and tried different things, but have not yet succeeded.
The URL I'm trying to clear is this . It has state, government data and no allegations against the web scraper. It is in Portuguese, but I believe that it will not be a big problem.
A search form with several fields is displayed here. My test looked for data from a certain state ("RJ", in this case the "UF" field) and the city ("Rio de Janeiro", in the "MUNICIPIO" field). By clicking "Pesquisar" (Search), it displays the following result:

Using Firebug, I found that the URL it calls (using the options above):
http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10
The site uses jsessionid, as can be seen from the following:
library(rvest) library(httr) url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/") cookies(url)
Knowing that it uses jsessionid, I used cookies (url) to verify this information and used it in a new url like this:
url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2") html_text(url)
Well, there is no data on the output. He actually has an error message. The translation into English basically means that the session has expired.
I assume this is a major mistake, but I looked around and could not find a way to overcome this.