Web scrapes website with javascript dynamic content

Therefore, I use python and beautifulsoup4 (which I am not attached to) to clean the website. The problem is that when I use urlib to capture html pages, this is not the whole page, because part of it is created through javascript. Is there any way around this?

+4
source share
1 answer

There are basically two main options:

  • using the browser developer tools, see which ajax requests will load the page and simulate them in your script, you probably have to use the json module to load the json response string into the python data structure.
  • Use tools like selenium that open a real browser. The browser can also be "headless", see Mute Selenium Testing with Python and PhantomJS

The first option is more difficult to implement, and it is, generally speaking, more fragile, but it does not require a real browser and can be faster.

, , , , . Selenium - BeautifulSoup . , , .

, .

+8

Source: https://habr.com/ru/post/1533920/


All Articles