How to handle web scraper to change urls

Question

How to handle web scraper to change urls

Recently, I have been doing some work on web scrapers. After some research and analysis, I could hang it. But I stuck to a point that I can’t find suitable answers even after searching the Internet. The point for which I am stuck, through web scraping, I enter the intranet page with the user login and password. For this URL in my code, I can get the data, but when the URL changed, my code was unable to log in due to the reason the code got into the wrong URL. Now the code that gets into the link is a kind of Agent that clicks on the URL on the update command.

I would like to know any good tool or some book that can help me understand how to apply artificial intelligence on a web scraper. with this, I can dynamically process my agents without having to manually configure them again. Any help could be very enjoyable.

+4

artificial-intelligence web-crawler web-scraping jsoup

chaosguru Jun 19 '13 at 6:56

source share

1 answer

Ovidiu Alexa · Answer 1 · 2014-04-17T14:09:43+0000

If the links change frequently, you can read the headings sent from the old link and see if there are headings to redirect to new links.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3

these are html redirect codes

I do not know what software you use to clean up, but I am sure that it can handle redirection.

for example: in CURL written in php, the following code is used to redirect

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //FROM http://stackoverflow.com/questions/3519939/make-curl-follow-redirects

To answer your request

I would like to know any good tool or some book that can help me understand about the use of artificial intelligence in a web scraper

PHP is a good tool for understanding basic web scraping, but it's not as fast as you might imagine. The fastest technology I know for this is ERLANG. But it is not so friendly to beginners.

How to handle web scraper to change urls

More articles: