Download documents from aspx webpage in R

I am trying to automatically download documents for oil and gas wells from the Colorado Oil and Gas Conservation Commission (COGCC) using the "rvest" and "downloader" packages in R.

Link to a table / form that contains documents for a particular well; http://ogccweblink.state.co.us/results.aspx?id=12337064

"id = 12337064" is a unique identifier for the well

Documents on the form page can be downloaded by clicking on them. The following is an example. http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781

"DocumentID = 3172781" is the unique document identifier for the uploaded document. In this case, the xlsm file. Other file formats on the document page include PDF and xls.

So far, I could write code to download any document for any well, but it is limited only to the first page. Most wells have documents on multiple pages, and I cannot upload documents to pages other than page 1 (all document pages have a similar URL)

## Extract the document id for document to be downloaded in this case "DIRECTIONAL DATA". Used the SelectorGadget tool to extract the CSS path
library(rvest)
html <- html("http://ogccweblink.state.co.us/results.aspx?id=12337064")
File <- html_nodes(html, "tr:nth-child(24) td:nth-child(4) a")
File <- as(File[[1]],'character')
DocId<-gsub('[^0-9]','',File)
DocId
[1] "3172781"

## To download the document, I use the downloader package
library(downloader)
linkDocId<-paste('http://ogccweblink.state.co.us/DownloadDocument.aspx DocumentId=',DocId,sep='')
download(linkDocId,"DIRECTIONAL DATA" ,mode='wb')

    trying URL 'http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781'
Content type 'application/octet-stream' length 33800 bytes (33 KB)
downloaded 33 KB

Does anyone know how I can change my code to upload documents to other pages?

Many thanks!

Em

0
source share
1 answer

cookie . :

  • RCurl URL- cookie:

    url   <- 'http://ogccweblink.state.co.us/results.aspx?id=12337064'
    library(RCurl)
    curl  <- curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = getCurlHandle())
    page1 <- getURL(url, curl = curl)
    
  • VIEWSTATE EVENTVALIDATION HTML:

    page1 <- htmlTreeParse(page1, useInternal = TRUE)
    viewstate  <- xpathSApply(page1, '//input[@name = "__VIEWSTATE"]', xmlGetAttr, 'value')
    validation <- xpathSApply(page1, '//input[@name = "__EVENTVALIDATION"]', xmlGetAttr, 'value')
    
  • URL cookie, INPUT :

    page2 <- postForm(url, curl = curl,
             .params = list(
                 '__EVENTARGUMENT'   = 'Page$2',
                 '__EVENTTARGET'     = 'WQResultGridView',
                 '__VIEWSTATE'       = viewstate,
                 '__EVENTVALIDATION' = validation))
    
  • URL- , :

    page2 <- htmlTreeParse(page2, useInternal = TRUE)
    xpathSApply(page2, '//td/font/a', xmlGetAttr, 'href')
    
0

Source: https://habr.com/ru/post/1615678/


All Articles