How can I retrieve a list of URLs obtained while rendering an HTML page in python?

I want to get a list of all the URLs that the browser will execute to get a GET request when we try to open the page. For example, if we try to open cnn.com, the first HTTP response has several URLs that the browser recursively requests.

I am not trying to display the page, but I am trying to get a list of all the URLs that are requested when rendering the page. A simple check of the contents of the HTTP response will not be sufficient, since there could potentially be images in the downloadable css. Anyway, can I do this in python?

+2
source share
2 answers

You may have to display the page (optionally displaying it) to make sure that you get a complete list of all resources. I used PyQT and QtWebKit in similar situations. Especially when you start counting resources dynamically enabled using javascript, trying to parse and load pages recursively using BeautifulSoup just won't work.

Ghost.py is a great client to help you get started with PyQT. Also, check out the QWebView Docs and QNetworkAccessManager docs .

Ghost.py returns a tuple (page, resources) when the page is opened:

 from ghost import Ghost ghost = Ghost() page, resources = ghost.open('http://my.web.page') 

resources includes all resources loaded by the source URL as HttpResource objects. You can get the URL for the downloaded resource using resource.url .

0
source

I think you will need to create a list of all known file extensions that you DO NOT want, and then scan the contents of the HTTP response, checking "if the substring is not in the non-list:"

The problem is that all hrefs end with TLDs, forwardslashes, url-passed variables, etc., so I think it would be easier to check that you know what you don't want.

0
source

Source: https://habr.com/ru/post/1482907/


All Articles