Let's see if I understood your question correctly. I know this answer is probably inadequate, but if you need a more specific answer, I will need more information.
Are you trying to program a web crawler but cannot crawl URLs that end in .php?
If this is the case, you need to take a step back and think about why this is so. This may be because the crawler selects which URLs are crawled using a regular expression based on the URI scheme.
In most cases, these URLs are plain HTML, but they can also be a generated image (for example, captcha) or a download link for the 700mb iso file, and there is no way to find out without checking the HTTP response header from this URL addresses.
Note. If you are writing your own crawler from scratch, you will need a good understanding of HTTP .
The first thing your crawler sees when it receives the URL is the header that contains the MIME content-type - it tells the browser / crawler how to process and open the data (this is HTML, plain text, .exe, etc.). You probably want to load pages based on the MIME type instead of the URL scheme. The MIME type for HTML is text/html , and you should check this using your HTTP library before loading the rest of the contents of the URL.
Javascript issue
The same as above, except that running javascript in the crawler / parser is quite unusual for simple projects and can create more problems than it solves. Why is javascript needed?
Another solution
If you want to learn Python (or already know this), I suggest you look at Scrapy . This is a web workaround structure similar to the Django web infrastructure . It is really easy to use, and many problems have already been resolved, so this can be a good starting point if you are trying to learn more about this technology.
source share