I wrote a library to try to simplify this process, pdfquery . To extract text from a specific location on a specific page, follow these steps:
pdf = pdfquery.PDFQuery(file)
If you want to find individual characters in this field, instead of text lines entirely inside this field, pass merge_tags = None to PDFQuery (by default, it combines consecutive characters into one element to make the tree less funny, so the whole line should be inside the box). If you want to find something partially overlapping a window, use: overlaps_bbox instead of: in_bbox.
This is basically PyQuery syntax syntax for capturing text from a PDFMiner layout, so if your document is too messy for PDFMiner, it might be too messy for it, but at least it will play faster.
source share