JavaScript execution in href links with Python

Question

JavaScript execution in href links with Python

I am trying to download a PDF file automatically on a site ( http://bibliotecadigitalhispanica.bne.es ) using Python.

I tried using urllib / urllib2 / mechanize modules (which I used for other sites: this includes standard functions like urlopen, urlretrieve, etc.), but here the links have built-in JavaScript in their href attributes that do some processing and open a PDF file, which, apparently, these modules can not process, at least from what I read here. For example, when I do the following:

request = mechanize.Request('the example url below') response = mechanize.urlopen(request)

it just returns the containing HTML page - I just can’t extract the PDF file (there are no links to it inside this page).

I know by looking at the headers in a real browser (using the LiveHTTPHeaders extension in Firefox) that a lot of HTTP requests have been made and as a result the PDF returns (and displays in the browser). I would like to be able to intercept this and load it. Specifically, I get a series of 302 and 304 responses, which ultimately leads to PDF.

Here is an example of the link attribute that I am scanning: href = 'javascript: open_window_delivery (" http://bibliotecadigitalhispanica.bne.es:80/verylonglinktoaccess ");'

It seems that if I run this JavaScript built into the href attribute, I can eventually reach the PDF document itself. I tried with selenium, but it's a bit confusing - I'm not quite sure how to use it while reading its documentation. Can someone suggest a way (either through a module that I have not tried, or through the one that I have) that I can do this?

Thanks so much for any help with this.

PS: If you want to see what I am trying to replicate, I am trying to access the PDF links mentioned above on the next page (with PDF pictograms) :): http://bibliotecadigitalhispanica.bne.es/R/9424CFL1MDQGLGBB98QSV1HFAD2APYDME4GQK0BSLBFL2FQ4LQFLF4FQFL2 func = collections-result & collection_id = 1356

+6

javascript python web-crawler urllib mechanize

spanport Mar 16 '12 at 9:55

source share

1 answer

j13r · Accepted Answer · 2012-03-18T21:56:59+0000

JavaScript: open_window_delivery ("http://bibliotecadigitalhispanica.bne.es:80/webclient/DeliveryManager?application=DIGITOOL-3&owner=resourcediscovery&custom_att_2=simple_viewer&forebear_coll&13p_dle_jle&jp_handle_gle&jp_handle_gleest_handle_27_&lep_handle_27_&lep_handle_27_&lep_handle_handle_handle_handle_handle_handle_jest bibliotecadigitalhispanica.bne.es: 80 / R / 7IUR42HNR5J19AY1Y3QJTL1P9M2AN81RCY4DRFE8JN5T22BI7I-03416 ");

This URL leads to page 302. If you execute it, you will be on the frame page, where the bottom frame is the content page.

http://bibliotecadigitalhispanica.bne.es///exlibris/dtl/d3_1/apache_media/L2V4bGlicmlzL2R0bC9kM18xL2FwYWNoZV9tZWRpYS8xNjczNDE2.pdf

(lib) curl can follow 302 pages.

Javascript is not a problem yet. Then you are in the single_viewer_toolbar2.jsp file, where the setLabelMetadataStream function combines the URL for the pdf file before sending it to its iframe "sendRequestIFrame".

I see 3 possibilities:

javascript execution approach: high complexity, it is necessary to program a lot of code, possibly fragile.
Something based on the browser: Selenium is probably good. I know that elinks2 supports javascript, and according to its wikipedia page, it can be written in "Perl, Ruby, Lua and GNU Guile" scripts.
Contact your web administrator for help. You must do this anyway in order to understand your policy / attitude towards bots. Perhaps they can provide you (and others) with an interface / API.

I recommend learning more about Selena, it seems the easiest.

JavaScript execution in href links with Python

More articles: