Use BeautifulSoup to extract text before the first user tag

Question

Use BeautifulSoup to extract text before the first user tag

From this html source:

<div class="category_link"> Category: <a href="/category/personal">Personal</a> </div>

I want to extract the text Category:

Here are my attempts using Python / BeautifulSoup (with output as comment - after #)

 parsed = BeautifulSoup(sample_html) parsed_div = parsed.findAll('div')[0] parsed_div.firstText() # <a href="/category/personal">Personal</a> parsed_div.first() # <a href="/category/personal">Personal</a> parsed_div.findAll()[0] # <a href="/category/personal">Personal</a>

I expect the text node to be available as the first child. Any suggestions on how I can solve this?

+6

python beautifulsoup

Elvis D'Souza Apr 14 '12 at 14:08

source share

1 answer

Shrikant sharat · Accepted Answer · 2012-04-14T14:53:28+0000

I'm sure the following should do what you want

 parsed.find('a').previousSibling # or something like that

This will return a NavigableString instance, which is about the same as a unicode instance, but you can call it unicode to get unicode.

I will see if I can verify this and let you know.

EDIT : I just confirmed that it works:

 >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>') >>> soup.find('a') <a href="/">a link</a> >>> soup.find('a').previousSibling u'Category: ' >>>

Use BeautifulSoup to extract text before the first user tag

More articles: