Support updated web page file cache in Python?

I am writing an application that needs constant local access to large files received via http. I want the files to be stored in a local directory (some kind of partial mirror), so that the subsequent application steps just notice that the URLs are already mirrored locally and so other programs can use them.

Ideally, it will also save timestamp or etag information and be able to make a quick HTTP request with an If-Modified-Since or If-None-Match header to check for a new version, but avoid a full download if the file has been updated. However, since these web pages rarely change, I can probably live with errors from outdated copies and just find other ways to delete files from the cache when necessary.

Looking around, I can see that urllib.request.urlretrieve can save cached copies, but it looks like it cannot handle my if-Modified-Since cache update target.

The query module looks like the last and largest, but it does not seem to work for this case. There is a CacheControl module that supports my cache update task, as it caches HTTP completely. But it doesnโ€™t seem to save the extracted files in a way that can be used directly for other (non-Python) programs, since FileCache stores resources in the form of pickled data. And the discussion in can python-request retrieve the url directly into a file descriptor on disk, like curl? - Stack Overflow suggests that saving to a local file can be done with some additional code, but that doesn't seem to combine very well with the CacheControl module.

So is there a web library that does what I want? This can essentially support mirroring files that were extracted in the past (and tell me what file names) without having to manage this explicitly?

+5
source share
2 answers

I had the same requirements and I found requests-cache . It was very easy to add because it expands requests . You can either store the cache in memory and disappear after the script completes, or make it persistent using sqlite, mongodb, or redis. Here are two lines that I wrote, and they worked as advertised:

 import requests, requests_cache requests_cache.install_cache('scraper_cache', backend='sqlite', expire_after=3600) 
+3
source

I do not think that there is a library that does this, but it is not so difficult to implement. Here are some functions that I use with queries that may help you:

 import os import os.path as op import requests import urlparse CACHE = 'path/to/cache' def _file_from_url(url): return op.basename(urlparse.urlsplit(url).path) def is_cached(*args, **kwargs): url = kwargs.get('url', args[0] if args else None) path = op.join(CACHE, _file_from_url(url)) if not op.isfile(path): return False res = requests.head(*args, **kwargs) if not res.ok: return False # Check if cache is stale. For me, checking content-length fitted my use case. # You can use modification date or etag here: if not 'content-length' in res.headers: return False return op.getsize(path) == int(res.headers['content-length']) def update_cache(*args, **kwargs): url = kwargs.get('url', args[0] if args else None) path = op.join(CACHE, _file_from_url(url)) res = requests.get(*args, **kwargs) if res.ok: with open(path, 'wb') as handle: for block in res.iter_content(1024): if block: handle.write(block) handle.flush() 

Using:

 if not is_cached('http://www.google.com/humans.txt'): update_cache('http://www.google.com/humans.txt') # Do something with cache/humans.txt 
+2
source

Source: https://habr.com/ru/post/1207630/


All Articles