Using the Beautiful Soup Python Module to replace tags with plain text

Question

Using the Beautiful Soup Python Module to replace tags with plain text

I use Beautiful Soup to extract “content” from web pages. I know that some people asked about this question , and they all pointed to Beautiful Soup and how I started with this.

I managed to get most of the content, but I ran into some issues with tags that are part of the content. (I start with the basic strategy: if the node has more x-characters, then this is the content). Take the html code below as an example:

<div id="abc"> some long text goes <a href="/"> here </a> and hopefully it will get picked up by the parser as content </div> results = soup.findAll(text=lambda(x): len(x) > 20)

When I use the above code to get long text, it breaks (the identified text starts with "and hopefully ..") in the tags. So I tried replacing the tag with plain text as follows:

 anchors = soup.findAll('a') for a in anchors: a.replaceWith('plain text')

The above does not work, because Beautiful Soup inserts a line like NavigableString and causes the same problem when I use findAll with len (x)> 20. I can use regular expressions to parse html as plain text, clear all unnecessary tags, and then call Beautiful Soup. But I would like to avoid processing the same content twice - I am trying to parse these pages so that I can show a piece of content for this link (very similar to Facebook Share) - and if everything is done using Beautiful Soup, I assume that it will be faster.

So my question is: is there a way to “clear tags” and replace them with “plain text” with Beautiful Soup. If not, what would be the best way to do this?

Thanks for your suggestions!

Update: Alex code worked very well for an example example. I also tried various edge cases, and they all worked fine (with a change below). So I gave him a chance on the real life website, and I ran into problems that puzzled me.

 import urllib from BeautifulSoup import BeautifulSoup page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/') anchors = soup.findAll('a') i = 0 for a in anchors: print str(i) + ":" + str(a) for a in anchors: if (a.string is None): a.string = '' if (a.previousSibling is None and a.nextSibling is None): a.previousSibling = a.string elif (a.previousSibling is None and a.nextSibling is not None): a.nextSibling.replaceWith(a.string + a.nextSibling) elif (a.previousSibling is not None and a.nextSibling is None): a.previousSibling.replaceWith(a.previousSibling + a.string) else: a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling) a.nextSibling.extract() i = i+1

When I run the above code, I get the following error:

 0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with Switched CES 2010 coverage</a> Traceback (most recent call last): File "parselink.py", line 44, in <module> a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling) TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

When I look at the HTML code, "Stay up to date." I don’t have a previous brother (I did not know how the previous sibling worked until I saw the Alex code and, based on my testing, it looks like it is looking for the “text” before the tag). So, if there is no previous brother, I am surprised that he does not go through the if.previousSibling logic - this is None and a; nextSibling - None.

Could you tell me what I am doing wrong?

-ecognium

+1

python html-content-extraction

Ecognium Jan 14 '10 at 1:58

source share

2 answers

When I tried to smooth the tags in the document, in this way, all the content of the tags would be attached to its parent node (I would like to reduce the contents of the p tag with all the sub-items, lists, div and span , etc. inside, but get rid of the styles and strong tags and some terrible remnants of the word-to-html generator), I found it to be quite difficult to do with BeautifulSoup, since extract () also removes the content and replaceWith (), unfortunately, does not accept None as an argument. After some recursion experiments, I decided to use regular expressions before or after processing the document using BeautifulSoup in the following way:

 import re def flatten_tags(s, tags): pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags))) return pattern.sub("", s)

A tag argument is either a single tag or a list of tags that should be flattened.

+1

aldi Jul 02 '10 at 17:05

source share

Alex martelli · Accepted Answer · 2010-01-14T02:49:02+0000

An approach that works for your specific example:

 from BeautifulSoup import BeautifulSoup ht = ''' <div id="abc"> some long text goes <a href="/"> here </a> and hopefully it will get picked up by the parser as content </div> ''' soup = BeautifulSoup(ht) anchors = soup.findAll('a') for a in anchors: a.previousSibling.replaceWith(a.previousSibling + a.string) results = soup.findAll(text=lambda(x): len(x) > 20) print results

which emits

 $ python bs.py [u'\n some long text goes here ', u' and hopefully it \n will get picked up by the parser as content\n']

Of course, you probably have to take care a bit, that is, if there is no a.string , or if a.previousSibling is None , you will need suitable if to take care of such corner cases. But I hope this general idea can help you. (In fact, you can also concatenate the next brother if this is a string - you don’t know how this works with your heuristic len(x) > 20 , but let's say, for example, that you have two 9-character lines with <a> , containing 5 -character strings in the middle, maybe you want to pick the lot as a “string of 23 characters? I can’t say because I don’t understand the motivation of your heuristic).

I believe that besides the <a> tags, you will also want to remove others, such as <b> or <strong> , maybe <p> and / or <br> , etc ....? I think it also depends on what the real idea of your heuristic is!

Using the Beautiful Soup Python Module to replace tags with plain text

More articles: