Take a look at org.apache.pdfbox.examples.util.PrintTextLocations
. I used it very little, and it is very useful to analyze the placement of elements and bounding fields in PDF documents. He also identified elements printed in white ink or outside the printable area (presumably watermarks of a document or โforgottenโ items left out of the authorโs field of vision).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You will get something like this:
Processing page: 0 String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e ...
What you can easily disassemble and use to build the position of the element, the bounding box and the "flow" (trajectory through all elements), etc. for each page. As I am sure you already know, you will find that a PDF file is almost impossible to convert to text. This is really just a graphic description format (i.e., for a printer or screen), and not a markup language. You can easily create a PDF file that prints "Hello world", but which accidentally moves around character positions (and if you choose it, uses different glyphs than any ISO char encoding), which makes PDF very difficult to convert to text, No concepts of "word" or "paragraph". For example, a two-columned document can be a nightmare for parsing text.
In the second part of your question, I had good results using xpdf version 3.02 after fixing Xref.cc (make XRef::okToPrint()
, XRef::okToChange()
, XRef::okToCopy()
and XRef::okToAddNotes()
all return gTrue
). This is for processing blocked documents, not encrypted ones (there are other utilities for this).
source share