Mechanize html scraper

so I'm trying to extract my site's email using ruby ​​mechanization and hpricot. that I'm trying to make my loop on the entire page of my administration and parse pages with hpricot.so so well. Then I get:

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

when it parses a bunch of pages, it starts with a timeout and then prints the html code of the page. I can not understand why? how can i debug this? It seems that the machine operator can get more than 10 pages in a row? Is it possible?? thanks



require 'logger' require 'rubygems' require 'mechanize' require 'hpricot' require 'open-uri'

class Harvester

def initialize(page) @page=page @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") } @agent.keep_alive=false @agent.read_timeout=15

end

def login f = @agent.get( "http://****.com/admin/index.asp") .forms.first f.set_fields(:username => "user", :password =>"pass") f.submit
end

def harvest(s) pageNumber=1 #@agent.read_timeout = s.upto(@page) do |pagenb|

    puts "*************************** page= #{pagenb}/#{@page}***************************************"      
    begin
        #time=Time.now
        #search=@agent.get( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")          
        extract(pagenb)

    rescue => e
        puts  "unknown #{e.to_s}"
        #puts  "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
        #sleep(2)
        extract(pagenb)

    rescue Net::HTTPBadResponse => e
        puts "net exception"+ e.to_s
    rescue WWW::Mechanize::ResponseCodeError => ex
        puts "mechanize error: "+ex.response_code   
    rescue Timeout::Error => e
        puts "timeout: "+e.to_s
    end


end

def extract ()     #puts search.body           search=@agent.get( "http://***.com/admin/members.asp? action = search & term = & state_id = & r = 500 & p = # {page}" )           doc = Hpricot (search.body)

        #remove titles
        #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

        (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|              
            #delete the phone number from the html
            temp = tr.search("/td[2]").inner_html
            index = temp.index('<')
            email = temp[0..index-1]
            puts  email
            f=File.open("./emails", 'a')
            f.puts(email)
            f.close     
        end 

" ..."

start = ARGV [0].to_i

= Harvester.new(186) h.login h.harvest()

code>
+3
source share
1 answer

, . ,

@mech = WWW::Mechanize.new do |agent|
  agent.history.max_size = 1
end
+3

Source: https://habr.com/ru/post/1708970/


All Articles