Python check type

I wrote a python crawler, urls samples have different types: it can be url with html and url with image or large archives or other files. Therefore, I need to quickly identify this case to prevent reading of large files, such as large archives, and to continue scanning. What is the best way to determine the type of url when a page starts loading? I understand that I can do this by the name of the URL (end with .rar.jpg, etc.), but I think this is not a complete solution. Do I need a control header or something like that? also i need some predictions of page size to prevent large downloads. In other words, set a limit on the size of the loaded page to prevent fast memory usage.

+3
source share
1 answer

If you use an HTTP HEAD request on a resource, you will receive the corresponding metadata on the resource without the resource data itself. In particular, content and content headers will be interesting.

eg.

HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net

HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT

You can do this in python using httplib:

>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]

This tells you an image (image / * mime-type) of 1150 bytes. There is enough information for you to decide if you want to get the full resource.

In addition, this header reports that the server accepts an HTTP request for partial content (the accept-range header), which allows you to retrieve data in packets.

You will get the same header information if you directly execute a GET, but it will also start sending resource data to the response body, which you want to avoid.

HTTP , -, 'Fetch'

+6

Source: https://habr.com/ru/post/1764370/


All Articles