Return text string from physical coordinates to PDF using Python

Question

Return text string from physical coordinates to PDF using Python

I have been struggling with Google and the limited PDFMiner documentation in the last few hours, and although I feel close, I just don't get what I need. I worked through http://www.unixuser.org/~euske/python/pdfminer/ and all three YouTube videos to better understand PDF files, and I can output the source text just fine.

I am working on a script to parse multiple PDF pages. Unfortunately, for this project I am dealing with low quality PDF files, and the only reliable constant that I see is the physical layout of the text lines in exactly the same way. Although I read hints that text strings can be extracted using physical coordinates, I have yet to see a working example.

Is there anyone who can shed light on how this is done using PDFMiner? I am open to other modules if there is an obvious better choice, however I need to stick with Python for the script.

In addition, I did not even try to use PyPdf (except for the main text output).

Thanks!

+3

python pdf

user1145643 Feb 18 '12 at 18:11

source share

2 answers

I wrote a library to try to simplify this process, pdfquery . To extract text from a specific location on a specific page, follow these steps:

pdf = pdfquery.PDFQuery(file) # load first, third, fourth pages pdf.load(0, 2, 3) # find text between 100 and 300 points from left bottom corner of first page text = pdf.pq('LTPage[page_index=0] :in_bbox("100,100,300,300")').text() # save tree as XML to try to figure out why the last line didn't work the way you expected :) pdf.tree.write(filename, pretty_print=True)

If you want to find individual characters in this field, instead of text lines entirely inside this field, pass merge_tags = None to PDFQuery (by default, it combines consecutive characters into one element to make the tree less funny, so the whole line should be inside the box). If you want to find something partially overlapping a window, use: overlaps_bbox instead of: in_bbox.

This is basically PyQuery syntax syntax for capturing text from a PDFMiner layout, so if your document is too messy for PDFMiner, it might be too messy for it, but at least it will play faster.

+4

Jack cushman Apr 16 '12 at 21:13

source share

alexis · Accepted Answer · 2012-02-18T18:51:38+0000

I managed to find my way through pdfminer thanks to some Denis Papatanasiou code. The code is discussed on his blog , and you can find the source here: layout_scanner.py

In particular, look at the parse_lt_objs () method. In the last loop of k there should be a pair containing the coordinates of this bit of text (and discarded). I don’t have a working coordinator to publish here (it didn’t interest me), but it seems like you can easily find your way out of there.

Good luck to you!

Return text string from physical coordinates to PDF using Python

More articles: