I am writing an application that needs constant local access to large files received via http. I want the files to be stored in a local directory (some kind of partial mirror), so that the subsequent application steps just notice that the URLs are already mirrored locally and so other programs can use them.
Ideally, it will also save timestamp or etag information and be able to make a quick HTTP request with an If-Modified-Since or If-None-Match header to check for a new version, but avoid a full download if the file has been updated. However, since these web pages rarely change, I can probably live with errors from outdated copies and just find other ways to delete files from the cache when necessary.
Looking around, I can see that urllib.request.urlretrieve can save cached copies, but it looks like it cannot handle my if-Modified-Since cache update target.
The query module looks like the last and largest, but it does not seem to work for this case. There is a CacheControl module that supports my cache update task, as it caches HTTP completely. But it doesnโt seem to save the extracted files in a way that can be used directly for other (non-Python) programs, since FileCache stores resources in the form of pickled data. And the discussion in can python-request retrieve the url directly into a file descriptor on disk, like curl? - Stack Overflow suggests that saving to a local file can be done with some additional code, but that doesn't seem to combine very well with the CacheControl module.
So is there a web library that does what I want? This can essentially support mirroring files that were extracted in the past (and tell me what file names) without having to manage this explicitly?
source share