404 not found, but may work fine with web browser

I tried many URLs and they seem to be fine until I came across this specific question:

require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html")) puts doc 

This is the result:

 /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError) from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open' from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open' from test.rb:5:in `<main>' 

I can access this from a web browser, I just don't understand it.

What is happening and how can I deal with such an error? Can I ignore him and let the rest do their job?

+5
source share
3 answers

You get 404 Not Found (OpenURI::HTTPError) , so if you want your code to continue, save for this exception. Something like this should work:

 require 'nokogiri' require 'open-uri' URLS = %w[ http://www.moxyst.com/fashion/men-clothing/underwear.html ] URLs.each do |url| begin doc = Nokogiri::HTML(open(url)) rescue OpenURI::HTTPError => e puts "Can't access #{ url }" puts e.message puts next end puts doc.to_html end 

You can use more general exceptions, but then you encounter problems getting strange output or you can handle an unrelated problem in such a way as to cause more problems, so you need to figure out the necessary granularity.

You can even sniff the HTTPd headers, the response status, or see the exception message if you want even more control and want to do something else for 401 or 404.

I can access this from a web browser, I just don't understand it.

Well, maybe something is happening on the server side: maybe they don’t like the UserAgent line you are sending? The OpenURI documentation shows how to change this header:

Additional header fields can be specified using an optional hash argument.

 open("http://www.ruby-lang.org/en/", "User-Agent" => "Ruby/#{RUBY_VERSION}", "From" => " foo@bar.invalid ", "Referer" => "http://www.ruby-lang.org/") {|f| # ... } 
+5
source

You may need to pass the "User-Agent" as a parameter to open the method. Some sites require a valid User-Agent, otherwise they simply do not respond or do not show a 404 error not found.

 doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html", "User-Agent" => "MyCrawlerName (http://mycrawler-url.com)")) 
+5
source

So what is happening and how can I deal with such an error.

I don’t know what is going on, but you can handle it by catching an error.

 begin doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html")) puts doc rescue => e puts "I failed: #{e}" end 

Is it possible to simply ignore it and let the rest do their job?

Of course! May be? Not sure. We do not know your requirements.

+2
source

Source: https://habr.com/ru/post/1201936/


All Articles