Web Crawler PHP / Javascript Link Parsing?

Question

Web Crawler PHP / Javascript Link Parsing?

I am currently using the HTML Agility Pack in C # for a web crawler. I have managed to avoid many problems so far (invalid URIs like "/extra/url/to/base.html" and "#"), but I also need to handle PHP, Javascript, etc. As with some sites, the links are in PHP, and when my web crawler tries to go to them, it fails. One example is the PHP / Javascript accordion reference page. How do I navigate / parse these links?

+4

c # web-crawler

cam Feb 19 '10 at 13:13

source share

1 answer

hannson · Accepted Answer · 2010-02-23T19:23:03+0000

Let's see if I understood your question correctly. I know this answer is probably inadequate, but if you need a more specific answer, I will need more information.

Are you trying to program a web crawler but cannot crawl URLs that end in .php?

If this is the case, you need to take a step back and think about why this is so. This may be because the crawler selects which URLs are crawled using a regular expression based on the URI scheme.

In most cases, these URLs are plain HTML, but they can also be a generated image (for example, captcha) or a download link for the 700mb iso file, and there is no way to find out without checking the HTTP response header from this URL addresses.

Note. If you are writing your own crawler from scratch, you will need a good understanding of HTTP .

The first thing your crawler sees when it receives the URL is the header that contains the MIME content-type - it tells the browser / crawler how to process and open the data (this is HTML, plain text, .exe, etc.). You probably want to load pages based on the MIME type instead of the URL scheme. The MIME type for HTML is text/html , and you should check this using your HTTP library before loading the rest of the contents of the URL.

Javascript issue

The same as above, except that running javascript in the crawler / parser is quite unusual for simple projects and can create more problems than it solves. Why is javascript needed?

Another solution
If you want to learn Python (or already know this), I suggest you look at Scrapy . This is a web workaround structure similar to the Django web infrastructure . It is really easy to use, and many problems have already been resolved, so this can be a good starting point if you are trying to learn more about this technology.

Web Crawler PHP / Javascript Link Parsing?

More articles: