(Windows 7 / R version 3.0.1)
Below are the commands and the resulting error:
> library(tm) > pdf <- readPDF(PdftotextOptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp \RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory
How to solve this problem?
Edit i
(As Ben suggested and described here )
I downloaded Xpdf, copied the 32-bit version to C:\Program Files (x86)\xpdf32 and the 64-bit version of C:\Program Files\xpdf64
The pdfinfo and pdftotext environment variables refer to the corresponding executable files, either 32-bit (tested with R 32 bit) or 64-bit (tested with R 64-bit)
EDIT II
One very confusing observation is that, starting with a new session (tm does not load), the last command will throw an error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL \pdfinfode8283c422f': No such file or directory
I do not understand this because the function variable is not yet defined by tm.readPDF. Below you will find that the pdf function refers to "natural" and to the fact that tm.readPDF is returned:
> pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), "-"), stdout = TRUE) PlainTextDocument(content, meta$Author, meta$CreationDate, meta$Subject, meta$Title, id, meta$Creator, language) } <environment: 0x0674bd8c> > library(tm) > pdf <- readPDF(PdftotextOptions = "-layout") > pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), "-"), stdout = TRUE) PlainTextDocument(content, meta$Author, meta$CreationDate, meta$Subject, meta$Title, id, meta$Creator, language) } <environment: 0x0c3d7364>
There seems to be no difference - then why use readPDF at all?
EDIT III
The pdf file is here: C:\Users\Raffael\Documents
> getwd() [1] "C:/Users/Raffael/Documents"
EDIT IV
The first instruction in pdf() is a call to tm:::pdfinfo() - and there an error occurs in the first few lines:
> outfile <- tempfile("pdfinfo") > on.exit(unlink(outfile)) > status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")), + stdout = outfile) > tags <- c("Title", "Subject", "Keywords", "Author", "Creator", + "Producer", "CreationDate", "ModDate", "Tagged", "Form", + "Pages", "Encrypted", "Page size", "File size", "Optimized", + "PDF version") > re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:", + tags)), collapse = "|")) > lines <- readLines(outfile, warn = FALSE) Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc
Apparently tempfile() just doesn't create the file.
> outfile <- tempfile("pdfinfo") > outfile [1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9"
The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and contains some files, but none of them are named pdfinfo8d437bd65d9 .