Python: Google search scraper with BeautifulSoup

Question

Python: Google search scraper with BeautifulSoup

Purpose: to pass a search bar for google search and scrape url, a title and a short description that will be published along with the url title.

I have the following code, and at the moment my code gives only the first 10 results, which are the default Google limit for one page. I'm not sure how to actually handle pagination during web cropping. Also, when I look at the actual results of the page and what prints, there is a discrepancy. I'm also not sure what the best way to parse span elements.

So far, I have a range as follows, and I want to remove the <em> element and concatenate the rest of the sting. What would be the best way to do this?

 <span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span

the code:

 from BeautifulSoup import BeautifulSoup import urllib, urllib2 def google_scrape(query): address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query)) request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}) urlfile = urllib2.urlopen(request) page = urlfile.read() soup = BeautifulSoup(page) linkdictionary = {} for li in soup.findAll('li', attrs={'class':'g'}): sLink = li.find('a') print sLink['href'] sSpan = li.find('span', attrs={'class':'st'}) print sSpan return linkdictionary if __name__ == '__main__': links = google_scrape('beautifulsoup')

My conclusion is as follows:

 http://www.crummy.com/software/BeautifulSoup/ <span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span> http://pypi.python.org/pypi/BeautifulSoup/3.2.1 <span class="st"><span class="f">Feb 16, 2012 &ndash; </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span> http://www.beautifulsouptheatercollective.org/ <span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span> http://lxml.de/elementsoup.html <span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span> https://launchpad.net/beautifulsoup/ <span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> &middot; Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is the current focus of development <b>...</b><br /></span> http://www.poetry-online.org/carroll_beautiful_soup.htm <span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span> http://www.youtube.com/watch?v=hDG73IAO5M8 <span class="st"><span class="f">Jul 6, 2009 &ndash; </span>taken from the motion picture &quot;Alice in wonderland&quot; (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span> http://www.soupsong.com/ <span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span> http://www.facebook.com/beautifulsouptc <span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We&#39;re thrilled to announce the cast of <em>Beautiful Soup&#39;s</em> upcoming production of <b>...</b><br /></span> http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/ <span class="st"><span class="f">Mar 15, 2009 &ndash; </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#39;s simply no way around it; so I should better confess it in <b>...</b><br /></span>

The results of a Google search page are structured as follows:

 <li class="g"> <div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ"> <h3 class="r"> <div class="vspib" aria-label="Result details" role="button" tabindex="0"> <div class="s"> <div class="f kv"> <div id="poS5" class="esc slp" style="display:none"> <div class="f slp">3 answers&nbsp;-&nbsp;Jan 16, 2009</div> <span class="st"> I read this without finding the solution: <b>...</b> The "normal" way is to: Go to the <em>Beautiful Soup</em> web site, <b>...</b> Brian beat me too it, but since I already have <b>...</b> <br> </span> </div> <div> </div> <h3 id="tbpr_6" class="tbpr" style="display:none"> </li>

each search result is displayed in the <li> element.

+6

python web-scraping screen-scraping urllib beautifulsoup

Null-hypothesis Jul 16 '12 at 10:39

source share

2 answers

Chrisguest · Answer 1 · 2012-07-17T05:14:03+0000

This list comprehension will split the tag.

 >>> sSpan <span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span> >>> [em.replaceWithChildren() for em in sSpan.findAll('em')] [None] >>> sSpan <span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>

Null-hypothesis · Answer 2 · 2012-07-17T17:59:57+0000

I built a simple html regex and then called the replace function on the cleared line to remove the points

 import re p = re.compile(r'<.*?>') print p.sub('',str(sSpan)).replace('.','')

Before

 <span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

After

 The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things,

Python: Google search scraper with BeautifulSoup

More articles: