Mechanize html scraper

Question

Mechanize html scraper

so I'm trying to extract my site's email using ruby mechanization and hpricot. that I'm trying to make my loop on the entire page of my administration and parse pages with hpricot.so so well. Then I get:

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

when it parses a bunch of pages, it starts with a timeout and then prints the html code of the page. I can not understand why? how can i debug this? It seems that the machine operator can get more than 10 pages in a row? Is it possible?? thanks

require 'logger' require 'rubygems' require 'mechanize' require 'hpricot' require 'open-uri'

class Harvester

def initialize(page) @page=page @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") } @agent.keep_alive=false @agent.read_timeout=15

end

def login f = @agent.get( "http://****.com/admin/index.asp") .forms.first f.set_fields(:username => "user", :password =>"pass") f.submit end

def harvest(s) pageNumber=1 #@agent.read_timeout = s.upto(@page) do |pagenb|

    puts "*************************** page= #{pagenb}/#{@page}***************************************"      
    begin
        #time=Time.now
        #search=@agent.get( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")          
        extract(pagenb)

    rescue => e
        puts  "unknown #{e.to_s}"
        #puts  "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
        #sleep(2)
        extract(pagenb)

    rescue Net::HTTPBadResponse => e
        puts "net exception"+ e.to_s
    rescue WWW::Mechanize::ResponseCodeError => ex
        puts "mechanize error: "+ex.response_code   
    rescue Timeout::Error => e
        puts "timeout: "+e.to_s
    end


end

def extract () #puts search.body search=@agent.get( "http://***.com/admin/members.asp? action = search & term = & state_id = & r = 500 & p = # {page}" ) doc = Hpricot (search.body)

        #remove titles
        #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

        (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|              
            #delete the phone number from the html
            temp = tr.search("/td[2]").inner_html
            index = temp.index('<')
            email = temp[0..index-1]
            puts  email
            f=File.open("./emails", 'a')
            f.puts(email)
            f.close     
        end

" ..."

start = ARGV [0].to_i

= Harvester.new(186) h.login h.harvest()

code>

+3

ruby screen-scraping mechanize

fenec May 24, '09 at 6:09

source share

1 answer

Fluffy · Accepted Answer · 2009-08-27T14:12:22+0000

, . ,

@mech = WWW::Mechanize.new do |agent|
  agent.history.max_size = 1
end

Mechanize html scraper

More articles: