What is the fastest and easiest way to download all images from a website

What is the fastest and easiest way to download all images from a website? More specifically, http://www.cycustom.com/large/ .

I think something like wget or curl.

To clarify, at first (and above all) I don’t know how to complete this task. Secondly, I am interested to know if wget or curl has an easier to understand solution. Thanks.

--- UPDATE @sarnold ---

Thanks for answering. I thought that would help too. However, it is not. Here is the output of the command:

wget --mirror --no-parent http://www.cycustom.com/large/ --2012-01-10 18:19:36-- http://www.cycustom.com/large/ Resolving www.cycustom.com... 64.244.61.237 Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `www.cycustom.com/large/index.html' [ <=> ] 188,795 504K/s in 0.4s Last-modified header missing -- time-stamps turned off. 2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795] Loading robots.txt; please ignore errors. --2012-01-10 18:19:37-- http://www.cycustom.com/robots.txt Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 174 [text/plain] Saving to: `www.cycustom.com/robots.txt' 100%[======================================================================================================================================================================================================================================>] 174 --.-K/s in 0s 2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174] FINISHED --2012-01-10 18:19:37-- Downloaded: 2 files, 185K in 0.4s (505 KB/s) 

Here is an image of the created files https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg

My goal is to have a folder with image files. The next team did not achieve this goal.

 wget --mirror --no-parent http://www.cycustom.com/large/ 
+6
source share
2 answers

The robots.txt file can be ignored by adding the following option:

 -e robots=off 

I also recommend adding a parameter to slow down loading in order to limit server load. For example, these options wait 30 seconds between one file and the following:

 --wait 30 
+4
source
 wget --mirror --no-parent http://www.example.com/large/ 

--no-parent does not allow it to delete the entire site.


Ahh, I see they placed robots.txt requesting robots so as not to upload photos from this directory:

 $ curl http://www.cycustom.com/robots.txt User-agent: * Disallow: /admin/ Disallow: /css/ Disallow: /flash/ Disallow: /large/ Disallow: /pdfs/ Disallow: /scripts/ Disallow: /small/ Disallow: /stats/ Disallow: /temp/ $ 

wget(1) does not document any method to ignore robots.txt , and I never found an easy way to accomplish the --mirror equivalent in curl(1) . If you want to continue using wget(1) , you will need to insert an HTTP proxy in the middle, which returns 404 for GET /robots.txt requests.

I think it’s easier to change the approach. Since I need more experience with Nokogiri , here is what I came up with:

 #!/usr/bin/ruby require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open("http://www.cycustom.com/large/")) doc.css('tr > td > a').each do |link| name = link['href'] next unless name.match(/jpg/) File.open(name, "wb") do |out| out.write(open("http://www.cycustom.com/large/" + name)) end end 

It's just a quick and dirty script - embedding the url twice is a little ugly. Therefore, if it is intended for long-term use of a product, first clean it or figure out how to use rsync(1) .

+3
source

Source: https://habr.com/ru/post/905673/


All Articles