I am trying to read the text contents of a pdf file into a Perl variable. From the other SO questions and answers, I understand that I need to use CAM::PDF . Here is my code:
#!/usr/bin/perl -w use CAM::PDF; my $pdf = CAM::PDF->new('1950-01-01.pdf'); print $pdf->numPages(), " pages\n\n"; my $text = $pdf->getPageText(1); print $text, "\n";
I tried to run this this pdf file . Perl does not report errors. The first print instruction works; it prints "2 pages", which is the correct number of pages in this document.
The following print statement does not return anything readable. Here's what the result looks like in Emacs:
2 pages ^A^B^C^D^E^C^F^D^G^H ^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E ^F^G^G^H^E ^K^L ^M^N^E^O^P^E^O^Q^R^S^E .... more lines with similar codes ....
Is there something I can do to make this work? I don't know much about PDFs, but I thought that since I can easily copy and paste text from a PDF using Acrobat, it should be recognized as text, not an image, so I was hoping that means I can extract it with Perl.
Any guidance would be greatly appreciated.
source share