Error trying to read PDF file using readPDF from tm package

(Windows 7 / R version 3.0.1)

Below are the commands and the resulting error:

> library(tm) > pdf <- readPDF(PdftotextOptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp \RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory 

How to solve this problem?




Edit i

(As Ben suggested and described here )

I downloaded Xpdf, copied the 32-bit version to C:\Program Files (x86)\xpdf32 and the 64-bit version of C:\Program Files\xpdf64

The pdfinfo and pdftotext environment variables refer to the corresponding executable files, either 32-bit (tested with R 32 bit) or 64-bit (tested with R 64-bit)




EDIT II

One very confusing observation is that, starting with a new session (tm does not load), the last command will throw an error:

 > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL \pdfinfode8283c422f': No such file or directory 

I do not understand this because the function variable is not yet defined by tm.readPDF. Below you will find that the pdf function refers to "natural" and to the fact that tm.readPDF is returned:

 > pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), "-"), stdout = TRUE) PlainTextDocument(content, meta$Author, meta$CreationDate, meta$Subject, meta$Title, id, meta$Creator, language) } <environment: 0x0674bd8c> > library(tm) > pdf <- readPDF(PdftotextOptions = "-layout") > pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), "-"), stdout = TRUE) PlainTextDocument(content, meta$Author, meta$CreationDate, meta$Subject, meta$Title, id, meta$Creator, language) } <environment: 0x0c3d7364> 

There seems to be no difference - then why use readPDF at all?




EDIT III

The pdf file is here: C:\Users\Raffael\Documents

 > getwd() [1] "C:/Users/Raffael/Documents" 



EDIT IV

The first instruction in pdf() is a call to tm:::pdfinfo() - and there an error occurs in the first few lines:

 > outfile <- tempfile("pdfinfo") > on.exit(unlink(outfile)) > status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")), + stdout = outfile) > tags <- c("Title", "Subject", "Keywords", "Author", "Creator", + "Producer", "CreationDate", "ModDate", "Tagged", "Form", + "Pages", "Encrypted", "Page size", "File size", "Optimized", + "PDF version") > re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:", + tags)), collapse = "|")) > lines <- readLines(outfile, warn = FALSE) Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc 

Apparently tempfile() just doesn't create the file.

 > outfile <- tempfile("pdfinfo") > outfile [1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9" 

The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and contains some files, but none of them are named pdfinfo8d437bd65d9 .

+5
r tm
Jul 31 '13 at 19:20
source share
1 answer

The intersting on my machine after a new pdf start is a function of converting an image to PDF:

  getAnywhere(pdf) A single object matching 'pdf' was found It was found in the following places package:grDevices namespace:grDevices [etc.] 

But back to the problem of reading in PDF files as text, messing around with PATH is a bit-and-miss (and annoying if you work on several different computers), so I think the easiest and safest method is to call pdf2text , using system as described here by Tony Braial .

In your case, it will be (note the two sets of quotes):

 system(paste('"C:/Program Files/xpdf64/pdftotext.exe"', '"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE) 

This can be easily expanded using the *apply function or the loop if you have many PDF files.

+4
Jul 31 '13 at 9:23
source share



All Articles