I use Beautiful Soup to extract “content” from web pages. I know that some people asked about this question , and they all pointed to Beautiful Soup and how I started with this.
I managed to get most of the content, but I ran into some issues with tags that are part of the content. (I start with the basic strategy: if the node has more x-characters, then this is the content). Take the html code below as an example:
<div id="abc"> some long text goes <a href="/"> here </a> and hopefully it will get picked up by the parser as content </div> results = soup.findAll(text=lambda(x): len(x) > 20)
When I use the above code to get long text, it breaks (the identified text starts with "and hopefully ..") in the tags. So I tried replacing the tag with plain text as follows:
anchors = soup.findAll('a') for a in anchors: a.replaceWith('plain text')
The above does not work, because Beautiful Soup inserts a line like NavigableString and causes the same problem when I use findAll with len (x)> 20. I can use regular expressions to parse html as plain text, clear all unnecessary tags, and then call Beautiful Soup. But I would like to avoid processing the same content twice - I am trying to parse these pages so that I can show a piece of content for this link (very similar to Facebook Share) - and if everything is done using Beautiful Soup, I assume that it will be faster.
So my question is: is there a way to “clear tags” and replace them with “plain text” with Beautiful Soup. If not, what would be the best way to do this?
Thanks for your suggestions!
Update: Alex code worked very well for an example example. I also tried various edge cases, and they all worked fine (with a change below). So I gave him a chance on the real life website, and I ran into problems that puzzled me.
import urllib from BeautifulSoup import BeautifulSoup page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/') anchors = soup.findAll('a') i = 0 for a in anchors: print str(i) + ":" + str(a) for a in anchors: if (a.string is None): a.string = '' if (a.previousSibling is None and a.nextSibling is None): a.previousSibling = a.string elif (a.previousSibling is None and a.nextSibling is not None): a.nextSibling.replaceWith(a.string + a.nextSibling) elif (a.previousSibling is not None and a.nextSibling is None): a.previousSibling.replaceWith(a.previousSibling + a.string) else: a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling) a.nextSibling.extract() i = i+1
When I run the above code, I get the following error:
0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with Switched CES 2010 coverage</a> Traceback (most recent call last): File "parselink.py", line 44, in <module> a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling) TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'
When I look at the HTML code, "Stay up to date." I don’t have a previous brother (I did not know how the previous sibling worked until I saw the Alex code and, based on my testing, it looks like it is looking for the “text” before the tag). So, if there is no previous brother, I am surprised that he does not go through the if.previousSibling logic - this is None and a; nextSibling - None.
Could you tell me what I am doing wrong?
-ecognium