I use to create some scanners to compile information, and as I get to the website, I need information. I launch a new crawler specific to this site using shell scripts most of the time and sometimes PHP.
The way I do it - it's easy for
to iterate through the list of pages, wget
download it, and sed
, tr
, awk
and other tools for cleaning page and capture the specific information I need.
The whole process takes some time depending on the site and more to load all the pages. And I often go to the AJAX site, which complicates everything.
I was wondering if there are better ways to do this, faster ways, or even some applications or languages ββto help with such work.
source
share