A (presumably primary) web scraper http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

Question

A (presumably primary) web scraper http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

I am very new to Python (and web scraping). Let me ask you a question.

Many websites do not actually report their specific URLs in Firefox or other browsers. For example, Social Security Admin shows the popular names of children with a rank (from 1880), but the URL does not change when I change the year from 1880 to 1881. He is constantly

http://www.ssa.gov/cgi-bin/popularnames.cgi

Since I do not know the specific URL, I could not load the webpage using urllib.

In this page source, it includes:

<input type="text" name="year" id="yob" size="4" value="1880">

So, suppose if I can control this "annual" value (for example, "1881" or "1991"), I can deal with this problem. I'm right? I still don't know how to do this.

Can someone tell me a solution for this, please?

If you know some websites that might help my research, please let me know.

THANKS!

+6

python web-scraping firebug cgi urllib

Hyun Jun 20 '13 at 18:23

source share

4 answers

I recommend using Scrapy . This is a very powerful and easy to use web scraping tool. Why you should try:

Speed / Performance / Efficiency
Scrapy is written with Twisted, a popular event-driven network for Python. Thus, it is implemented using non-blocking (aka asynchronous) for concurrency.
Database pipelining
Scrapy has an Item Pipelines feature:
After the item has been cleaned by a spider, it is sent to the Pipeline item, which processes it through several components that are executed sequentially.
Thus, each page can be written to the database immediately after loading it.
Code Organization
Scrapy offers you a good and clear project structure, where you can logically set parameters, spiders, elements, pipelines, etc. Even this simplifies and simplifies your code.
Time for code
Scrapy does a lot of work for you backstage. This forces you to focus on the code itself and the logic itself, rather than thinking about the "metal" part: creating processes, threads, etc.

Yes, you understand - I like it.

To start:

Hope this helps.

+3

alecxe Jun 20 '13 at 19:24

source share

I recommend using a tool like mechanize . This will allow you to programmatically navigate web pages using python. There are many guides to using this. In principle, what you want to do in mechanization is the same as in the browser: fill in the text box, click the "Go" button and analyze the web page received from the answer.

+2

Lanaru Jun 20 '13 at 18:33

source share

I used the mechanoid / BeautifulSoup libraries for similar things earlier. If I had such a project, I would also look at https://github.com/scrapy/scrapy

+2

Maxim khesin Jun 20 '13 at 19:00

source share

That1guy · Accepted Answer · 2013-06-20T18:59:00+0000

You can use urllib . The button performs a POST for the current URL. Using Firefox Firebug I looked at network traffic and found that it sends 3 parameters: member , top and year . You can send the same arguments:

 import urllib url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi' post_params = { # member was blank, so I'm excluding it. 'top' : '25', 'year' : year } post_args = urllib.urlencode(post_params)

Now just send url-encoded arguments:

 urllib.urlopen(url, post_args)

If you need to also send headers:

 headers = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.5', 'Connection' : 'keep-alive', 'Host' : 'www.ssa.gov', 'Referer' : 'http://www.ssa.gov/cgi-bin/popularnames.cgi', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0' } # With POST data: urllib.urlopen(url, post_args, headers)

Run the code in a loop:

 for year in xrange(1880, 2014): # The above code...

A (presumably primary) web scraper http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

More articles: