First python script, scraper, recommendations are welcome

Question

First python script, scraper, recommendations are welcome

I just finished my first python script, a scraper for these choices from philipines. I don't have a programming background, I used stata for statistical analysis and cursed a bit in R, since I want to switch at some point. But I want to learn python to retrieve data from websites and other sources. So far, I have only been looking at the python tutorial, O'Reilly's “Learning Python” is waiting on my shelf. I wrote the following script, taking inspiration from other peoples scripts and looking at the documentation of the included packages.

I am mainly looking for general advice. The script really works, but are there any extra parts? Should I structure it differently? Are there typical (or plain stupid) beginner mistakes?

I collected some questions that I listed after the script.

import mechanize import lxml.html import csv site = "http://www.comelec.gov.ph/results/2004natl/2004electionresults_local.aspx" br = mechanize.Browser() response = br.open(site) output = csv.writer(file(r'output.csv','wb')) br.select_form(name="ctl00") provinces = br.possible_items("provlist") for prov in provinces: br.select_form(name="ctl00") br["provlist"] = [prov] response = br.submit() br.select_form(name="ctl00") pname = str(br.get_value_by_label("provlist")).strip("[]") municipalities = br.possible_items("munlist") for mun in municipalities: br.select_form(name="ctl00") br["munlist"] = [mun] response = br.submit(type="submit", name="ctl01") html = response.read() root = lxml.html.fromstring(html) try: table = root.get_element_by_id(id="dlistCandidates") data = [ [td.text_content().strip() for td in row.findall("td")] for row in table.findall('tr') ] except KeyError: print "Results not available yet." data = [ [ "." for i in range(5) ] ] br.select_form(name="ctl00") mname = str(br.get_value_by_label("munlist")).strip('[]') print pname, mname, data, "\n" for row in data: if row: row.append(pname) row.append(mname) output.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])

When I execute the script, I get the error message "DeprecationWarning: [item.name for an item in self.items]. What reason should I worry about this too?
Now I iterate over the numeric keys of the provinces, and then I select the name every time. Should I better build a dictionary at the beginning and loop over it?
Is there an easy way to encode the character "ene" (N with a tilde above) directly to normal N?
Applying data every time, how do I best compile everything and then write the csv file at the end? Would this be the best solution?
The site requires a lot of time to respond to each request. Getting all the data takes about an hour. I can speed it up by running scripts with a middle and combining a list of provinces. How can I send concurrent requests in one script? Ultimately, I want to get more data from this site, and it would be nice to speed up the process.
I tried both BeautifulSoup and the lxml module, but I liked the lxml solution better. What other modules are often useful for such tasks?
Is there any central register for documentation / help files for built-in modules and others? It seemed to me that the documents where are scattered everywhere, which is somewhat inconvenient. Writing help (something) often led to "something not found."

Any recommendations and criticism are welcome. English is not my native language, but I hope that I have managed to reduce errors at least.

+4

python lxml mechanize scraper

ilprincipe Apr 12 '11 at 10:03

source share

1 answer

Gareth mccaughan · Accepted Answer · 2011-04-12T10:38:54+0000

DeprecationWarning comes from the mechanize module and is issued when possible_items called. This suggests the best way to get the same effect. I do not know why the author did not make this more explicit.
I do not think it matters a lot.
You can look at http://effbot.org/zone/unicode-convert.htm .
Writing in stages, like you, looks good to me. Instead, you can make a list of lines, add to it in your loop, and then write it all at once at the end; the main advantage would be a slight increase in modularity. (Suppose you wanted to do the same, but use the result differently, you can use the code more easily.)
(a) If a remote site takes a long time to respond, and all your scraping from that remote site, are you sure that hitting it with several concurrent requests will really help at all? (b) You might want to check that the site owners do not mind craping this way, either because of politeness or because if they make an object, they may notice what you are doing and block you. I would suggest that since this is a government site, they are probably alright with it. (c) Take a look at the threading and multiprocessing modules in the Python standard library.
I dont know; excuse me.
No. (If you do not consider Google.)

It looks like you are doing a little back and forth to identify the provinces and municipalities. If they don't change between script calls, it might be worth saving them somewhere locally rather than requesting a remote website every time. (Winning is probably not worth the effort, but you can measure how long it takes to get this information.)

You might consider extracting code that turns HTML blob into a list of candidates (if that's what it is) into a separate function.

You might consider highlighting something like this in a separate function:

 def select_item(br, form, listname, value, submit_form=None): br.select_form(form) br[listname] = [value] return br.submit(type="submit", name=(submit_form or form))

and maybe something like this:

 def get_name(br, formname, label): br.select_form(formname) return str(br.get_value_by_label(label)).strip("[]")

First python script, scraper, recommendations are welcome

More articles: