Submit mechanism

I am trying to clear http://www.nscb.gov.ph/ggi/database.asp , namely all the tables that you get from choosing municipalities / provinces. I use python with lxml.html and mechanize. still my scraper works fine, however I get HTTP Error 500: Internal Server Error when submitting the municipality [19] โ€œPeรฑarrubia, Abraโ€. I suspect this is due to character encoding. My guess is that the en character (n with a tilde above) causes this problem. How can i fix this?

Below is a working example of this part of my script. Since I'm just starting out working in python (and often use snippets that I find on SO), any further comments are much appreciated.

 from BeautifulSoup import BeautifulSoup import mechanize import lxml.html import csv class PrettifyHandler(mechanize.BaseHandler): def http_response(self, request, response): if not hasattr(response, "seek"): response = mechanize.response_seek_wrapper(response) # only use BeautifulSoup if response is html if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']): soup = BeautifulSoup(response.get_data()) response.set_data(soup.prettify()) return response site = "http://www.nscb.gov.ph/ggi/database.asp" output_mun = csv.writer(open(r'output-municipalities.csv','wb')) output_prov = csv.writer(open(r'output-provinces.csv','wb')) br = mechanize.Browser() br.add_handler(PrettifyHandler()) # gets municipality stats response = br.open(site) br.select_form(name="form2") muns = br.find_control("strMunicipality2", type="select").items # municipality #19 is not working, those before do for pos, item in enumerate(muns[19:]): br.select_form(name="form2") br["strMunicipality2"] = [item.name] print pos, item.name response = br.submit(id="button2", type="submit") html = response.read() root = lxml.html.fromstring(html) table = root.xpath('//table')[1] data = [ [td.text_content().strip() for td in row.findall("td")] for row in table.findall("tr") ] print data, "\n" for row in data[2:]: if row: row.append(item.name) output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) response = br.open(site) #go back button not working # provinces follow here 

Many thanks!

edit: to be specific, an error occurs on this line

 response = br.submit(id="button2", type="submit") 
+6
source share
2 answers

Ok, found it. This is a wonderful soup that converts to unicode and prettify returns utf-8 by default. You should use:

 response.set_data(soup.prettify(encoding='latin-1')) 
+1
source

quick and dirty hacking:

 def _pairs(self): return [(k, v.decode('utf-8').encode('latin-1')) for (i, k, v, c_i) in self._pairs_and_controls()] from mechanize import HTMLForm HTMLForm._pairs = _pairs 

or something less invasive (I think there are no other solutions, because the Item class protects the name field)

 item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1') 

before

 br["strMunicipality2"] = [item.name] 
+1
source

Source: https://habr.com/ru/post/892230/


All Articles