so I'm trying to extract my site's email using ruby mechanization and hpricot. that I'm trying to make my loop on the entire page of my administration and parse pages with hpricot.so so well. Then I get:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
when it parses a bunch of pages, it starts with a timeout and then prints the html code of the page. I can not understand why? how can i debug this? It seems that the machine operator can get more than 10 pages in a row? Is it possible?? thanks
require 'logger'
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'open-uri'
class Harvester
def initialize(page)
@page=page
@agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") }
@agent.keep_alive=false
@agent.read_timeout=15
end
def login
f = @agent.get( "http://****.com/admin/index.asp") .forms.first
f.set_fields(:username => "user", :password =>"pass")
f.submit
end
def harvest(s)
pageNumber=1
#@agent.read_timeout =
s.upto(@page) do |pagenb|
puts "*************************** page= #{pagenb}/#{@page}***************************************"
begin
extract(pagenb)
rescue => e
puts "unknown #{e.to_s}"
extract(pagenb)
rescue Net::HTTPBadResponse => e
puts "net exception"+ e.to_s
rescue WWW::Mechanize::ResponseCodeError => ex
puts "mechanize error: "+ex.response_code
rescue Timeout::Error => e
puts "timeout: "+e.to_s
end
end
def extract () #puts search.body search=@agent.get( "http://***.com/admin/members.asp? action = search & term = & state_id = & r = 500 & p = # {page}" ) doc = Hpricot (search.body)
(doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|
temp = tr.search("/td[2]").inner_html
index = temp.index('<')
email = temp[0..index-1]
puts email
f=File.open("./emails", 'a')
f.puts(email)
f.close
end
" ..."
start = ARGV [0].to_i
= Harvester.new(186)
h.login
h.harvest()
code>
fenec source
share