It sounds like you found a solution, but if you ever want to do it without a web service or you need to clear the data based on its exact location on the PDF page, can I offer my library, pdfquery ? This basically turns PDF into an lxml tree that can be spit out as XML or parsed using XPath, PyQuery or whatever you want to use.
To use it, after you have saved the file to disk, you will return pdf = pdfquery.PDFQuery(name_pdf) or transfer the urllib file object directly if you do not need to save it. To get XML to deal with BeautifulSoup, you can do pdf.tree.tostring() .
If you don't mind using jQuery style selectors, there is a PyQuery interface with positional extensions, which can be very convenient. For example:
balance = pdf.pq(':contains("Your balance is")').text() strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
Jack Cushman Apr 16 2018-12-12T00: 00Z
source share