I am trying to clear http://www.nscb.gov.ph/ggi/database.asp , namely all the tables that you get from choosing municipalities / provinces. I use python with lxml.html and mechanize. still my scraper works fine, however I get HTTP Error 500: Internal Server Error when submitting the municipality [19] โPeรฑarrubia, Abraโ. I suspect this is due to character encoding. My guess is that the en character (n with a tilde above) causes this problem. How can i fix this?
Below is a working example of this part of my script. Since I'm just starting out working in python (and often use snippets that I find on SO), any further comments are much appreciated.
from BeautifulSoup import BeautifulSoup import mechanize import lxml.html import csv class PrettifyHandler(mechanize.BaseHandler): def http_response(self, request, response): if not hasattr(response, "seek"): response = mechanize.response_seek_wrapper(response) # only use BeautifulSoup if response is html if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']): soup = BeautifulSoup(response.get_data()) response.set_data(soup.prettify()) return response site = "http://www.nscb.gov.ph/ggi/database.asp" output_mun = csv.writer(open(r'output-municipalities.csv','wb')) output_prov = csv.writer(open(r'output-provinces.csv','wb')) br = mechanize.Browser() br.add_handler(PrettifyHandler()) # gets municipality stats response = br.open(site) br.select_form(name="form2") muns = br.find_control("strMunicipality2", type="select").items # municipality #19 is not working, those before do for pos, item in enumerate(muns[19:]): br.select_form(name="form2") br["strMunicipality2"] = [item.name] print pos, item.name response = br.submit(id="button2", type="submit") html = response.read() root = lxml.html.fromstring(html) table = root.xpath('//table')[1] data = [ [td.text_content().strip() for td in row.findall("td")] for row in table.findall("tr") ] print data, "\n" for row in data[2:]: if row: row.append(item.name) output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) response = br.open(site) #go back button not working # provinces follow here
Many thanks!
edit: to be specific, an error occurs on this line
response = br.submit(id="button2", type="submit")
source share