Reading text and image locations (xy coordinates) using PDFBox

Question

Reading text and image locations (xy coordinates) using PDFBox

I am making a java program to read encrypted PDF files and extract the contents of the page page by page, including text, images and their positions (x, y coordinates) in the file. Now I use PDFBox for this purpose, and I get text and images. But I could not get the position of the text and the position of the image. There are also some problems reading some encrypted PDF files.

+6

java pdfbox

Suresh somanathan Sep 28 '11 at 9:47

source share

1 answer

Pierre d · Answer 1 · 2012-10-12T18:25:29+0000

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations . I used it very little, and it is very useful to analyze the placement of elements and bounding fields in PDF documents. He also identified elements printed in white ink or outside the printable area (presumably watermarks of a document or “forgotten” items left out of the author’s field of vision).

Usage example:

 java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt

You will get something like this:

 Processing page: 0 String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e ...

What you can easily disassemble and use to build the position of the element, the bounding box and the "flow" (trajectory through all elements), etc. for each page. As I am sure you already know, you will find that a PDF file is almost impossible to convert to text. This is really just a graphic description format (i.e., for a printer or screen), and not a markup language. You can easily create a PDF file that prints "Hello world", but which accidentally moves around character positions (and if you choose it, uses different glyphs than any ISO char encoding), which makes PDF very difficult to convert to text, No concepts of "word" or "paragraph". For example, a two-columned document can be a nightmare for parsing text.

In the second part of your question, I had good results using xpdf version 3.02 after fixing Xref.cc (make XRef::okToPrint() , XRef::okToChange() , XRef::okToCopy() and XRef::okToAddNotes() all return gTrue ). This is for processing blocked documents, not encrypted ones (there are other utilities for this).

Reading text and image locations (xy coordinates) using PDFBox

More articles: