Extract data from a specific PDF position?

I am trying to extract data from pdf, which can be located at https://www.dol.gov/ui/data.pdf . The data that interests me is given on page 4 in PDF format and is 3 observations on initial claims (NSA), 3 observations on insured unemployment (NSA) and the last used week of employment (footnote 2),

I read the PDF file in R using pdftools, but the output is pretty ugly (as you would expect - due to the nature of the PDF files). Is there a way to extract specific data from this output? I believe that the data will always be in one place at the output, which is very useful.

The result that I am viewing can be seen with the following script:

library(pdftools)

download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")

uidata <- pdf_text("data.pdf")
uidata[4]

I searched for people with similar questions and was distorted with scan () and grep (), but it seems that I cannot find a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles about this and can point me in the right direction - if not, I will try to figure it out!

+4
source share
1 answer

With grepand with a little regex, you can get everything you need into a useful structure:

library(magrittr)

x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)

l <- lapply(seq_along(r), function(i){
    x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>% 
        trimws() %>% 
        gsub('\\s{2,}', ';', .) %>% 
        paste(collapse = '\n') %>% 
        read.csv2(text = ., dec = '.')
    })

from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))

l[[1]][3,]
#>                      WEEK.ENDING December.17 December.10  Change
#> Initial Claims (NSA)     315,613     305,333     +10,280 352,534
#>                      December.3
#> Initial Claims (NSA)    319,641

from_footnote
#> [1] 138322138

You still have to disassemble the numbers, but at least it can be used.

+6
source

Source: https://habr.com/ru/post/1665042/


All Articles