How to clear all files in a series of directories from national archives (archives.gov) using R

I am looking for a programmatic way to scrap all available files for a series of data files on archive.gov with R. archives.gov seems to be using javascript. my goal is to capture the URL of each available file, as well as the file name.

in the data file on actions to disclose information about the mortgage 153 entries

in the browser, I can click the "Export" button and get a csv file with this structure:

first_exported_record <- structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"), creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"), date = structure(1L, .Label = "1981 - 2013", class = "factor"), documentIndex = 1L, from.0 = structure(1L, .Label = "Series: Home Mortgage Disclosure Data Files, 1981 - 2013", class = "factor"), from.1 = structure(1L, .Label = "Record Group 82: Records of the Federal Reserve System, 1913 - 2003", class = "factor"), location.locationFacility1.0 = structure(1L, .Label = "National Archives at College Park - Electronic Records(RDE)", class = "factor"), location.locationFacility1.1 = structure(1L, .Label = "National Archives at College Park", class = "factor"), location.locationFacility1.2 = structure(1L, .Label = "8601 Adelphi Road", class = "factor"), location.locationFacility1.3 = structure(1L, .Label = "College Park, MD, 20740-6001", class = "factor"), location.locationFacility1.4 = structure(1L, .Label = "Phone: 301-837-0470", class = "factor"), location.locationFacility1.5 = structure(1L, .Label = "Fax: 301-837-3681", class = "factor"), location.locationFacility1.6 = structure(1L, .Label = "Email: cer@nara.gov ", class = "factor"), naId = 18491490L, title = structure(1L, .Label = "Non-restricted Ultimate Loan Application Register (LAR) Data, 2012", class = "factor"), url = structure(1L, .Label = "https://catalog.archives.gov/id/18491490", class = "factor")), .Names = c("resultType", "creators.0", "date", "documentIndex", "from.0", "from.1", "location.locationFacility1.0", "location.locationFacility1.1", "location.locationFacility1.2", "location.locationFacility1.3", "location.locationFacility1.4", "location.locationFacility1.5", "location.locationFacility1.6", "naId", "title", "url"), class = "data.frame", row.names = c(NA, -1L)) 

and then for each of these 153 entries there are file block pages with several files available for download. for example, the first exported records indicate:

https://catalog.archives.gov/id/18491490

but both of these pages look like javascript, so I'm not sure if I need something like phantomjs or selenium, or if there is some trick to export a directory with simpler tools like rvest?

at the point where I know every file URL, I can download them without problems:

 tf <- tempfile() download.file( "https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false" , tf , mode = 'wb' ) 

and this name will be

 "Technical Specifications Summary, 2012 Ultimate LAR." 

thanks!

update:

the specific question is how do I programmatically get from the first link (series identifier) ​​to the headers and URLs of all the files available for download within the series. I tried the rvest and httr commands, nothing useful to show for this.: / Thanks

+5
source share
4 answers

from people who wrote the API at the National Archives and Records Administration.

Hi Anthony

No need to scratch; The NARA directory has an open API. If I understand correctly, you want to download all media files (that our directory calls "objects") in all file units in the series "Home Page" Mortgage Disclosure Data Files "(NAID 2456161).

The API allows you to search in a field in any data field, and do not have a search parameter such as "parentNaId", the best way to do this is to search for a query in this specific field, ie return all records where the parent NAID is 2456161. If you open one of these file units to view data using the identifier (for example, https://catalog.archives.gov/api/v1?naIds=2580657 ), you can see the field containing the parent series is called "description.fileUnit.parentSeries". So, all your file records blocks and their objects will be https://catalog.archives.gov/api/v1?description.fileUnit.parentSeries=2456161 . If you want to return only objects without entries in the file, you can add the parameter "& type = object". Or, if you want the metadata file block, you can also limit the results using "type = description", since each record in the file also contains all the data for their child objects. It seems that there are more than 1000 results, so you will also need to use the "rows" parameter to query all the results in one query or paginate with the "offset" parameter and smaller "rows", values, since the default answer is only the first 10 results.

In the object’s metadata, you will place fields with URLs that you can use to load media, as well as other metadata that may be of interest. For example, please note that some of these objects are considered electronic records, as in the original archival documents from agencies, while others are technical documentation created by NARA. This is marked in the "designation" field.

Let me know if you have any questions.

Thanks! Dominic

+2
source

There is no need to load and analyze the page, since the records are loaded using a simple Ajax request.

To view requests, just view them using devtools and select the first one that returns some JSON. Then use the jsonlite library to request the same url with R. It will automatically analyze the result.

To view all files (description + URL) for 153 entries:

 library(jsonlite) options(timeout=60000) # increase timeout to 60sec (default is 10sec) json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all") ids = json$opaResponse$results$result$naId for (id in ids) { # each id json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id)) records = json$opaResponse$content$objects$objects$object for (r in 1:nrow(records)) { # each record # prints the file description and URL print(records[r, 'description']) print(records[r, '@renditionBaseUrl']) } } 
+5
source

If you are familiar with using httr, you can use the National Archive Directory API to interact with your server. When I read this website, there is a way to directly request and request data. This way you do not have to scratch the web page.

I played in the sandbox without the api key and got this far translation of your webpage request to api request:

 https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161 

Unfortunately, this does not recognize the name of the parentNaId field ... it is possible that the result is a lack of permission without the api key. In any case, I myself do not know R, so you will need to figure out how to use all this in httr.

Hope this helps a bit.

+3
source

If you want to use Rselenium and Rvest, you can use this code:

 library(RSelenium) library(rvest) url <- "https://catalog.archives.gov/search?q=*:*&f.parentNaId=2456161&f.level=fileUnit&sort=naIdSort%20asc&rows=500" rD <- rsDriver() remDr <- rD[["client"]] remDr$navigate(url) page <- read_html(remDr$getPageSource()[[1]]) links <- page %>% html_nodes(".row.result .titleResult a") %>% html_attr("href") links <- gsub("\\?\\&.{1,}","",links) links <- paste0("https://catalog.archives.gov",links) files <- NULL names <- NULL for (link in links) { remDr$navigate(link) Sys.sleep(3) page <- read_html(remDr$getPageSource()[[1]]) file <- page %>% html_nodes(".uer-list.documents .uer-row1 a") %>% html_attr("href") name <- page %>% html_nodes(".uer-list.documents .uer-row1 a span") %>% ht ml_text() ind <- which(regexpr("Technical",name) != -1) file <- file[ind] name <- name[ind] files <-c(files,file) names <-c(names,file) Sys.sleep(1) } 

Hope this works. I am using W10 x64

Gottavianoni

+2
source

Source: https://habr.com/ru/post/1274967/


All Articles