Convert pdf to text / html in python so i can parse it

I have the following code example where I download a pdf file from the European Parliament website for this legislative proposal:

EDIT: I ended up just getting the link and submitting it to the adobes online conversion tool (see code below):

import mechanize import urllib2 import re from BeautifulSoup import * adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html" url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp" def get_pdf(soup2): link = soup2.findAll("a", "com_acronym") new_link = [] amendments = [] for i in link: if "REPORT" in i["href"]: new_link.append(i["href"]) if new_link == None: print "No A number" else: for i in new_link: page = br.open(str(i)).read() bs = BeautifulSoup(page) text = bs.findAll("a") for i in text: if re.search("PDF", str(i)) != None: pdf_link = "http://www.europarl.europa.eu/" + i["href"] pdf = urllib2.urlopen(pdf_link) name_pdf = "%s_%s.pdf" % (y,p) localfile = open(name_pdf, "w") localfile.write(pdf.read()) localfile.close() br.open(adobe) br.select_form(name = "convertFrm") br.form["srcPdfUrl"] = str(pdf_link) br["convertTo"] = ["html"] br["visuallyImpaired"] = ["notcompatible"] br.form["platform"] =["Macintosh"] pdf_html = br.submit() soup = BeautifulSoup(pdf_html) page = range(1,2) #can be set to 400 to get every document for a given year year = range(1999,2000) #can be set to 2011 to get documents from all years for y in year: for p in page: br = mechanize.Browser() br.open(url) br.select_form(name = "byReferenceForm") br.form["year"] = str(y) br.form["sequence"] = str(p) response = br.submit() soup1 = BeautifulSoup(response) test = soup1.find(text="No search result") if test != None: print "%s %s No page skipping..." % (y,p) else: print "%s %s Writing dossier..." % (y,p) for i in br.links(url_regex="file.jsp"): link = i response2 = br.follow_link(link).read() soup2 = BeautifulSoup(response2) get_pdf(soup2) 

In the get_pdf () function, I would like to convert the pdf file to text in python so that I can parse the text to get information about the legislative procedure. can someone explain to me how to do this?

Thomas

+4
python text parsing pdf
Sep 03 '10 at 16:41
source share
3 answers

This is not entirely magic. I suggest

  • download the PDF file to the temp directory,
  • calling an external program to extract text to a text file (temp),
  • reading a text file.

For the command-line utility for extracting text, you have a number of options , and there may be others that are not mentioned in the link (possibly based on Java). Try them first to see if they fit your needs. That is, try each step separately (find links, download files, extract text), and then put them together. To call, use subprocess.Popen or subprocess.call() .

+2
Sep 03 '10 at 18:29
source share

It sounds like you found a solution, but if you ever want to do it without a web service or you need to clear the data based on its exact location on the PDF page, can I offer my library, pdfquery ? This basically turns PDF into an lxml tree that can be spit out as XML or parsed using XPath, PyQuery or whatever you want to use.

To use it, after you have saved the file to disk, you will return pdf = pdfquery.PDFQuery(name_pdf) or transfer the urllib file object directly if you do not need to save it. To get XML to deal with BeautifulSoup, you can do pdf.tree.tostring() .

If you don't mind using jQuery style selectors, there is a PyQuery interface with positional extensions, which can be very convenient. For example:

 balance = pdf.pq(':contains("Your balance is")').text() strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')] 
+8
Apr 16 2018-12-12T00:
source share

Have you checked PDFminer ?

+3
Sep 03 '10 at 16:46
source share



All Articles