JavaScript execution in href links with Python

I am trying to download a PDF file automatically on a site ( http://bibliotecadigitalhispanica.bne.es ) using Python.

I tried using urllib / urllib2 / mechanize modules (which I used for other sites: this includes standard functions like urlopen, urlretrieve, etc.), but here the links have built-in JavaScript in their href attributes that do some processing and open a PDF file, which, apparently, these modules can not process, at least from what I read here. For example, when I do the following:

request = mechanize.Request('the example url below') response = mechanize.urlopen(request) 

it just returns the containing HTML page - I just can’t extract the PDF file (there are no links to it inside this page).

I know by looking at the headers in a real browser (using the LiveHTTPHeaders extension in Firefox) that a lot of HTTP requests have been made and as a result the PDF returns (and displays in the browser). I would like to be able to intercept this and load it. Specifically, I get a series of 302 and 304 responses, which ultimately leads to PDF.

Here is an example of the link attribute that I am scanning: href = 'javascript: open_window_delivery (" http://bibliotecadigitalhispanica.bne.es:80/verylonglinktoaccess ");'

It seems that if I run this JavaScript built into the href attribute, I can eventually reach the PDF document itself. I tried with selenium, but it's a bit confusing - I'm not quite sure how to use it while reading its documentation. Can someone suggest a way (either through a module that I have not tried, or through the one that I have) that I can do this?

Thanks so much for any help with this.

PS: If you want to see what I am trying to replicate, I am trying to access the PDF links mentioned above on the next page (with PDF pictograms) :): http://bibliotecadigitalhispanica.bne.es/R/9424CFL1MDQGLGBB98QSV1HFAD2APYDME4GQK0BSLBFL2FQ4LQFLF4FQFL2 func = collections-result & collection_id = 1356

+6
source share
1 answer

JavaScript: open_window_delivery ("http://bibliotecadigitalhispanica.bne.es:80/webclient/DeliveryManager?application=DIGITOOL-3&owner=resourcediscovery&custom_att_2=simple_viewer&forebear_coll&13p_dle_jle&jp_handle_gle&jp_handle_gleest_handle_27_&lep_handle_27_&lep_handle_27_&lep_handle_handle_handle_handle_handle_handle_jest bibliotecadigitalhispanica.bne.es: 80 / R / 7IUR42HNR5J19AY1Y3QJTL1P9M2AN81RCY4DRFE8JN5T22BI7I-03416 ");

This URL leads to page 302. If you execute it, you will be on the frame page, where the bottom frame is the content page.

http://bibliotecadigitalhispanica.bne.es///exlibris/dtl/d3_1/apache_media/L2V4bGlicmlzL2R0bC9kM18xL2FwYWNoZV9tZWRpYS8xNjczNDE2.pdf

(lib) curl can follow 302 pages.

Javascript is not a problem yet. Then you are in the single_viewer_toolbar2.jsp file, where the setLabelMetadataStream function combines the URL for the pdf file before sending it to its iframe "sendRequestIFrame".

I see 3 possibilities:

  • javascript execution approach: high complexity, it is necessary to program a lot of code, possibly fragile.
  • Something based on the browser: Selenium is probably good. I know that elinks2 supports javascript, and according to its wikipedia page, it can be written in "Perl, Ruby, Lua and GNU Guile" scripts.
  • Contact your web administrator for help. You must do this anyway in order to understand your policy / attitude towards bots. Perhaps they can provide you (and others) with an interface / API.

I recommend learning more about Selena, it seems the easiest.

+1
source

Source: https://habr.com/ru/post/910891/


All Articles