Reading data from PDF files in R

Is it even possible!?!

I have a bunch of legacy reports that I need to import into the database. However, they are all in pdf format. Are there any R packages that can read pdf? Or should I leave this for the command line tool?

The reports were made in excel, and then in pdfed, so they have a regular structure, but many empty cells.

+47
linux r pdf pdf-scraping scrape
Feb 07 2018-12-12T00:
source share
5 answers

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the source document does not contain actual text, unlike bitmaps of text or perhaps even uglier things than I can imagine, nothing but OCR can help you.

In addition, in my sad experience, there is no guarantee that applications that create PDF documents behave the same, so the data in your spreadsheet may or may not be read in the desired order (as a result, the document was built). Be careful.

It is probably best to have a couple of grad students who transcribe the data for you. They are cheap :-)

+20
Feb 08 2018-12-12T00:
source share

So ... it closes me even on a rather complicated table.

Download pdf sample from bmi pdf

 library(tm) pdf <- readPDF(PdftotextOptions = "-layout") dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1') dat <- gsub(' +', ',', dat) out <- read.csv(textConnection(dat), header=FALSE) 
+31
Feb 08 2018-12-12T00:
source share

The current du jour package for retrieving text from pdftools PDF files (the pdftools successor noted above) works fine on Linux, Windows, and OSX:

 install.packages("pdftools") library(pdftools) download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb") txt <- pdf_text("1403.2805.pdf") # first page text cat(txt[1]) # second page text cat(txt[2]) 
+7
07 Sep '16 at 6:42 on
source share

You can also (now) use the new (2015-07) Rpoppler pacakge:

 Rpoppler::PDF_text(file) 

It includes 3 functions (4, actually, but you just get ptr for the PDF object):

  • PDF_fonts PDF font information
  • PDF_info PDF Document Information
  • PDF_text extract PDF text

(posting as an answer to help new search engines find the package).

+6
Oct 20 '15 at 11:26
source share

for zx8754 ... in Win7 it works with pdftotext.exe in the working directory:

 library(tm) uri = 'bmi_tbl.pdf' pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri), language = "en", id = "id1") 
+3
Jun 26 '15 at 13:18
source share



All Articles