Problem reading text from pdf in Perl

Question

Problem reading text from pdf in Perl

I am trying to read the text contents of a pdf file into a Perl variable. From the other SO questions and answers, I understand that I need to use CAM::PDF . Here is my code:

 #!/usr/bin/perl -w use CAM::PDF; my $pdf = CAM::PDF->new('1950-01-01.pdf'); print $pdf->numPages(), " pages\n\n"; my $text = $pdf->getPageText(1); print $text, "\n";

I tried to run this this pdf file . Perl does not report errors. The first print instruction works; it prints "2 pages", which is the correct number of pages in this document.

The following print statement does not return anything readable. Here's what the result looks like in Emacs:

 2 pages ^A^B^C^D^E^C^F^D^G^H ^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E ^F^G^G^H^E ^K^L ^M^N^E^O^P^E^O^Q^R^S^E .... more lines with similar codes ....

Is there something I can do to make this work? I don't know much about PDFs, but I thought that since I can easily copy and paste text from a PDF using Acrobat, it should be recognized as text, not an image, so I was hoping that means I can extract it with Perl.

Any guidance would be greatly appreciated.

+4

perl pdf

itzy Dec 23 '11 at 1:58

source share

2 answers

I am sure that the problem is not in your perl code, but in the PDF file. I ran the same script in one of my own pdf files and it works great.

+2

asf107 Dec 29 '11 at 18:11

source share

theglauber · Accepted Answer · 2011-12-29T19:27:07+0000

PDF files can have different types of content. PDF may not contain readable text at all, such as bitmap images and graphic content. The PDF you linked to contains compressed data. Open it with a text editor and you will see that the content is in the "/ Filter / FlateDecode" block. Perhaps CAM :: PDF does not support this. Google FlateDecode for a few ideas.

By carefully studying this PDF file, I see that it also uses built-in subsets of fonts with custom encodings. Even if CAM :: PDF handles compression, custom encoding may be what discards it. This may help: A web page from a software company that describes the problem.

Problem reading text from pdf in Perl

More articles: