A (presumably primary) web scraper http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

I am very new to Python (and web scraping). Let me ask you a question.

Many websites do not actually report their specific URLs in Firefox or other browsers. For example, Social Security Admin shows the popular names of children with a rank (from 1880), but the URL does not change when I change the year from 1880 to 1881. He is constantly

http://www.ssa.gov/cgi-bin/popularnames.cgi

Since I do not know the specific URL, I could not load the webpage using urllib.

In this page source, it includes:

<input type="text" name="year" id="yob" size="4" value="1880">

So, suppose if I can control this "annual" value (for example, "1881" or "1991"), I can deal with this problem. I'm right? I still don't know how to do this.

Can someone tell me a solution for this, please?

If you know some websites that might help my research, please let me know.

THANKS!

+6
source share
4 answers

You can use urllib . The button performs a POST for the current URL. Using Firefox Firebug I looked at network traffic and found that it sends 3 parameters: member , top and year . You can send the same arguments:

 import urllib url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi' post_params = { # member was blank, so I'm excluding it. 'top' : '25', 'year' : year } post_args = urllib.urlencode(post_params) 

Now just send url-encoded arguments:

 urllib.urlopen(url, post_args) 

If you need to also send headers:

 headers = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.5', 'Connection' : 'keep-alive', 'Host' : 'www.ssa.gov', 'Referer' : 'http://www.ssa.gov/cgi-bin/popularnames.cgi', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0' } # With POST data: urllib.urlopen(url, post_args, headers) 

Run the code in a loop:

 for year in xrange(1880, 2014): # The above code... 
+7
source

I recommend using Scrapy . This is a very powerful and easy to use web scraping tool. Why you should try:

  • Speed ​​/ Performance / Efficiency

    Scrapy is written with Twisted, a popular event-driven network for Python. Thus, it is implemented using non-blocking (aka asynchronous) for concurrency.

  • Database pipelining

    Scrapy has an Item Pipelines feature:

    After the item has been cleaned by a spider, it is sent to the Pipeline item, which processes it through several components that are executed sequentially.

    Thus, each page can be written to the database immediately after loading it.

  • Code Organization

    Scrapy offers you a good and clear project structure, where you can logically set parameters, spiders, elements, pipelines, etc. Even this simplifies and simplifies your code.

  • Time for code

    Scrapy does a lot of work for you backstage. This forces you to focus on the code itself and the logic itself, rather than thinking about the "metal" part: creating processes, threads, etc.

Yes, you understand - I like it.

To start:

Hope this helps.

+3
source

I recommend using a tool like mechanize . This will allow you to programmatically navigate web pages using python. There are many guides to using this. In principle, what you want to do in mechanization is the same as in the browser: fill in the text box, click the "Go" button and analyze the web page received from the answer.

+2
source

I used the mechanoid / BeautifulSoup libraries for similar things earlier. If I had such a project, I would also look at https://github.com/scrapy/scrapy

+2
source

Source: https://habr.com/ru/post/947773/


All Articles