Extract text from PDF using Poppler (C ++)

Question

Extract text from PDF using Poppler (C ++)

I am trying to get through Poppler and its (lack of) documentation.

What I want to do is very simple: open the PDF file and read the text in it. Then I will process the text, but that doesn't really matter.

So ... I saw the poppler_page_get_text function, and it seems to work, but I have to specify a selection rectangle, which is not very convenient. Isn't there just a simple function that outputs PDF text in order (maybe line by line?).

+3

c ++ pdf text-extraction poppler

nico Apr 28 '10 at 18:31

source share

2 answers

For records only, I'm using poppler right now with this little program

 #include <iostream> #include "poppler-document.h" #include "poppler-page.h" using namespace std; int main() { poppler::document *doc = poppler::document::load_from_file("./CMI2APIDocV1.4.pdf"); const int pagesNbr = doc->pages(); cout << "page count: " << pagesNbr << endl; for (int i = 0; i < pagesNbr; ++i) cout << doc->create_page(i)->text().to_latin1().c_str() << endl; } // g++ -I/usr/include/poppler/cpp/ -c poppler.cpp // g++ -I/usr/include/poppler/cpp poppler.o /usr/lib/x86_64-linux-gnu/libpoppler-cpp.a /usr/lib/x86_64-linux-gnu/libpoppler.a /usr/lib/x86_64-linux-gnu/liblcms2.so /usr/lib/x86_64-linux-gnu/libfontconfig.a /usr/lib/x86_64-linux-gnu/libjpeg.a /usr/lib/x86_64-linux-gnu/libfreetype.a /usr/lib/x86_64-linux-gnu/libexpat.a /usr/lib/x86_64-linux-gnu/libz.a

I am still pleased with the result, with the exception of arrays and a “spreadsheet” of restitution in clear text, where once a single cell can go through several lines. (if anyone knows how to avoid this?)

+3

yves Baumes Nov 04 '13 at 9:36

source share

plinth · Accepted Answer · 2010-04-29 19:13

You should be able to set the selection rectangle on the pageSize/MediaBox page and get all the text.

I say, because before you start to wonder why you are surprised at the conclusions of poppler_page_get_text , you need to know how the text is laid out on the page. All graphics are placed on the page using a program expressed in notation after correction. To make a page, this program runs on a blank page.

Operations in the program may include, changing colors, position, current transformation matrix, drawing lines, Bezier curves, and so on. The text is laid out by a number of text operators, which are always bracketed by BT (beginning of text) and ET (end of text). How and where the text is placed on the page, at the sole discretion of the software that generates the PDF. For example, for print drivers, the code answers GDI calls to DrawString and converts them into text drawing operations.

If you're lucky, the text on the page is laid out in a reasonable manner using a reasonable font, but many programs that generate PDFs are not so kind. Psroff , for example, liked to place all plain text first, then italic text, then bold text. Words may or may not be placed in reading order. Fonts can be recoded so that 'a' displayed on '{' or something else. Then you may have ligatures where several characters are replaced by single glyphs - the most common are ae , oe , fi , fl and ffl .

With all this, the process of extracting text is definitely not trivial, so do not be surprised if you see poor quality results when extracting text.

I used to work on text extraction tools in Acrobat 1.0 and 2.0 - this is a real problem to get right.

Extract text from PDF using Poppler (C ++)

More articles: