I am looking for a programmatic way to scrap all available files for a series of data files on archive.gov with R. archives.gov seems to be using javascript. my goal is to capture the URL of each available file, as well as the file name.
in the data file on actions to disclose information about the mortgage 153 entries
in the browser, I can click the "Export" button and get a csv file with this structure:
first_exported_record <- structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"), creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"), date = structure(1L, .Label = "1981 - 2013", class = "factor"), documentIndex = 1L, from.0 = structure(1L, .Label = "Series: Home Mortgage Disclosure Data Files, 1981 - 2013", class = "factor"), from.1 = structure(1L, .Label = "Record Group 82: Records of the Federal Reserve System, 1913 - 2003", class = "factor"), location.locationFacility1.0 = structure(1L, .Label = "National Archives at College Park - Electronic Records(RDE)", class = "factor"), location.locationFacility1.1 = structure(1L, .Label = "National Archives at College Park", class = "factor"), location.locationFacility1.2 = structure(1L, .Label = "8601 Adelphi Road", class = "factor"), location.locationFacility1.3 = structure(1L, .Label = "College Park, MD, 20740-6001", class = "factor"), location.locationFacility1.4 = structure(1L, .Label = "Phone: 301-837-0470", class = "factor"), location.locationFacility1.5 = structure(1L, .Label = "Fax: 301-837-3681", class = "factor"), location.locationFacility1.6 = structure(1L, .Label = "Email: cer@nara.gov ", class = "factor"), naId = 18491490L, title = structure(1L, .Label = "Non-restricted Ultimate Loan Application Register (LAR) Data, 2012", class = "factor"), url = structure(1L, .Label = "https://catalog.archives.gov/id/18491490", class = "factor")), .Names = c("resultType", "creators.0", "date", "documentIndex", "from.0", "from.1", "location.locationFacility1.0", "location.locationFacility1.1", "location.locationFacility1.2", "location.locationFacility1.3", "location.locationFacility1.4", "location.locationFacility1.5", "location.locationFacility1.6", "naId", "title", "url"), class = "data.frame", row.names = c(NA, -1L))
and then for each of these 153 entries there are file block pages with several files available for download. for example, the first exported records indicate:
https://catalog.archives.gov/id/18491490
but both of these pages look like javascript, so I'm not sure if I need something like phantomjs or selenium, or if there is some trick to export a directory with simpler tools like rvest?
at the point where I know every file URL, I can download them without problems:
tf <- tempfile() download.file( "https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false" , tf , mode = 'wb' )
and this name will be
"Technical Specifications Summary, 2012 Ultimate LAR."
thanks!
update:
the specific question is how do I programmatically get from the first link (series identifier) ββto the headers and URLs of all the files available for download within the series. I tried the rvest and httr commands, nothing useful to show for this.: / Thanks