wget --mirror --no-parent http://www.example.com/large/
--no-parent does not allow it to delete the entire site.
Ahh, I see they placed robots.txt requesting robots so as not to upload photos from this directory:
$ curl http://www.cycustom.com/robots.txt User-agent: * Disallow: /admin/ Disallow: /css/ Disallow: /flash/ Disallow: /large/ Disallow: /pdfs/ Disallow: /scripts/ Disallow: /small/ Disallow: /stats/ Disallow: /temp/ $
wget(1) does not document any method to ignore robots.txt , and I never found an easy way to accomplish the --mirror equivalent in curl(1) . If you want to continue using wget(1) , you will need to insert an HTTP proxy in the middle, which returns 404 for GET /robots.txt requests.
I think it’s easier to change the approach. Since I need more experience with Nokogiri , here is what I came up with:
It's just a quick and dirty script - embedding the url twice is a little ugly. Therefore, if it is intended for long-term use of a product, first clean it or figure out how to use rsync(1) .
source share