What is the best way to develop web crawlers

I use to create some scanners to compile information, and as I get to the website, I need information. I launch a new crawler specific to this site using shell scripts most of the time and sometimes PHP.

The way I do it - it's easy forto iterate through the list of pages, wgetdownload it, and sed, tr, awkand other tools for cleaning page and capture the specific information I need.

The whole process takes some time depending on the site and more to load all the pages. And I often go to the AJAX site, which complicates everything.

I was wondering if there are better ways to do this, faster ways, or even some applications or languages ​​to help with such work.

+3
source share
2 answers

Using regular expressions to parse content is a bad idea that has been considered in countless cases.

You have to parse the document in the DOM tree, and then you can pull out any hyperlinks, stylesheets, script files, images or other external links that you want, and go through them accordingly.

- (, curl PHP) HTML (, Beautiful Soup Python). - .

+6

python, Scrapy .

+2

Source: https://habr.com/ru/post/1702854/


All Articles