What is the fastest and easiest way to download all images from a website

Question

What is the fastest and easiest way to download all images from a website

What is the fastest and easiest way to download all images from a website? More specifically, http://www.cycustom.com/large/ .

I think something like wget or curl.

To clarify, at first (and above all) I don’t know how to complete this task. Secondly, I am interested to know if wget or curl has an easier to understand solution. Thanks.

--- UPDATE @sarnold ---

Thanks for answering. I thought that would help too. However, it is not. Here is the output of the command:

wget --mirror --no-parent http://www.cycustom.com/large/ --2012-01-10 18:19:36-- http://www.cycustom.com/large/ Resolving www.cycustom.com... 64.244.61.237 Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `www.cycustom.com/large/index.html' [ <=> ] 188,795 504K/s in 0.4s Last-modified header missing -- time-stamps turned off. 2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795] Loading robots.txt; please ignore errors. --2012-01-10 18:19:37-- http://www.cycustom.com/robots.txt Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 174 [text/plain] Saving to: `www.cycustom.com/robots.txt' 100%[======================================================================================================================================================================================================================================>] 174 --.-K/s in 0s 2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174] FINISHED --2012-01-10 18:19:37-- Downloaded: 2 files, 185K in 0.4s (505 KB/s)

Here is an image of the created files https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg

My goal is to have a folder with image files. The next team did not achieve this goal.

 wget --mirror --no-parent http://www.cycustom.com/large/

+6

curl wget

John erck Jan 11 '12 at 0:28

source share

2 answers

Andrea · Answer 1 · 2013-05-29T08:55:44+0000

The robots.txt file can be ignored by adding the following option:

 -e robots=off

I also recommend adding a parameter to slow down loading in order to limit server load. For example, these options wait 30 seconds between one file and the following:

 --wait 30

sarnold · Answer 2 · 2012-01-11T00:31:02+0000

 wget --mirror --no-parent http://www.example.com/large/

--no-parent does not allow it to delete the entire site.

Ahh, I see they placed robots.txt requesting robots so as not to upload photos from this directory:

 $ curl http://www.cycustom.com/robots.txt User-agent: * Disallow: /admin/ Disallow: /css/ Disallow: /flash/ Disallow: /large/ Disallow: /pdfs/ Disallow: /scripts/ Disallow: /small/ Disallow: /stats/ Disallow: /temp/ $

wget(1) does not document any method to ignore robots.txt , and I never found an easy way to accomplish the --mirror equivalent in curl(1) . If you want to continue using wget(1) , you will need to insert an HTTP proxy in the middle, which returns 404 for GET /robots.txt requests.

I think it’s easier to change the approach. Since I need more experience with Nokogiri , here is what I came up with:

 #!/usr/bin/ruby require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open("http://www.cycustom.com/large/")) doc.css('tr > td > a').each do |link| name = link['href'] next unless name.match(/jpg/) File.open(name, "wb") do |out| out.write(open("http://www.cycustom.com/large/" + name)) end end

It's just a quick and dirty script - embedding the url twice is a little ugly. Therefore, if it is intended for long-term use of a product, first clean it or figure out how to use rsync(1) .

What is the fastest and easiest way to download all images from a website

More articles: