I just finished my first python script, a scraper for these choices from philipines. I don't have a programming background, I used stata for statistical analysis and cursed a bit in R, since I want to switch at some point. But I want to learn python to retrieve data from websites and other sources. So far, I have only been looking at the python tutorial, O'Reilly's “Learning Python” is waiting on my shelf. I wrote the following script, taking inspiration from other peoples scripts and looking at the documentation of the included packages.
I am mainly looking for general advice. The script really works, but are there any extra parts? Should I structure it differently? Are there typical (or plain stupid) beginner mistakes?
I collected some questions that I listed after the script.
import mechanize import lxml.html import csv site = "http://www.comelec.gov.ph/results/2004natl/2004electionresults_local.aspx" br = mechanize.Browser() response = br.open(site) output = csv.writer(file(r'output.csv','wb')) br.select_form(name="ctl00") provinces = br.possible_items("provlist") for prov in provinces: br.select_form(name="ctl00") br["provlist"] = [prov] response = br.submit() br.select_form(name="ctl00") pname = str(br.get_value_by_label("provlist")).strip("[]") municipalities = br.possible_items("munlist") for mun in municipalities: br.select_form(name="ctl00") br["munlist"] = [mun] response = br.submit(type="submit", name="ctl01") html = response.read() root = lxml.html.fromstring(html) try: table = root.get_element_by_id(id="dlistCandidates") data = [ [td.text_content().strip() for td in row.findall("td")] for row in table.findall('tr') ] except KeyError: print "Results not available yet." data = [ [ "." for i in range(5) ] ] br.select_form(name="ctl00") mname = str(br.get_value_by_label("munlist")).strip('[]') print pname, mname, data, "\n" for row in data: if row: row.append(pname) row.append(mname) output.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
When I execute the script, I get the error message "DeprecationWarning: [item.name for an item in self.items]. What reason should I worry about this too?
Now I iterate over the numeric keys of the provinces, and then I select the name every time. Should I better build a dictionary at the beginning and loop over it?
Is there an easy way to encode the character "ene" (N with a tilde above) directly to normal N?
Applying data every time, how do I best compile everything and then write the csv file at the end? Would this be the best solution?
The site requires a lot of time to respond to each request. Getting all the data takes about an hour. I can speed it up by running scripts with a middle and combining a list of provinces. How can I send concurrent requests in one script? Ultimately, I want to get more data from this site, and it would be nice to speed up the process.
I tried both BeautifulSoup and the lxml module, but I liked the lxml solution better. What other modules are often useful for such tasks?
Is there any central register for documentation / help files for built-in modules and others? It seemed to me that the documents where are scattered everywhere, which is somewhat inconvenient. Writing help (something) often led to "something not found."
Any recommendations and criticism are welcome. English is not my native language, but I hope that I have managed to reduce errors at least.
source share