I am trying to download a PDF file automatically on a site ( http://bibliotecadigitalhispanica.bne.es ) using Python.
I tried using urllib / urllib2 / mechanize modules (which I used for other sites: this includes standard functions like urlopen, urlretrieve, etc.), but here the links have built-in JavaScript in their href attributes that do some processing and open a PDF file, which, apparently, these modules can not process, at least from what I read here. For example, when I do the following:
request = mechanize.Request('the example url below') response = mechanize.urlopen(request)
it just returns the containing HTML page - I just canβt extract the PDF file (there are no links to it inside this page).
I know by looking at the headers in a real browser (using the LiveHTTPHeaders extension in Firefox) that a lot of HTTP requests have been made and as a result the PDF returns (and displays in the browser). I would like to be able to intercept this and load it. Specifically, I get a series of 302 and 304 responses, which ultimately leads to PDF.
Here is an example of the link attribute that I am scanning: href = 'javascript: open_window_delivery (" http://bibliotecadigitalhispanica.bne.es:80/verylonglinktoaccess ");'
It seems that if I run this JavaScript built into the href attribute, I can eventually reach the PDF document itself. I tried with selenium, but it's a bit confusing - I'm not quite sure how to use it while reading its documentation. Can someone suggest a way (either through a module that I have not tried, or through the one that I have) that I can do this?
Thanks so much for any help with this.
PS: If you want to see what I am trying to replicate, I am trying to access the PDF links mentioned above on the next page (with PDF pictograms) :): http://bibliotecadigitalhispanica.bne.es/R/9424CFL1MDQGLGBB98QSV1HFAD2APYDME4GQK0BSLBFL2FQ4LQFLF4FQFL2 func = collections-result & collection_id = 1356