Use BeautifulSoup to extract text before the first user tag
From this html source:
<div class="category_link"> Category: <a href="/category/personal">Personal</a> </div>
I want to extract the text Category:
Here are my attempts using Python / BeautifulSoup (with output as comment - after #)
parsed = BeautifulSoup(sample_html) parsed_div = parsed.findAll('div')[0] parsed_div.firstText() # <a href="/category/personal">Personal</a> parsed_div.first() # <a href="/category/personal">Personal</a> parsed_div.findAll()[0] # <a href="/category/personal">Personal</a>
I expect the text node to be available as the first child. Any suggestions on how I can solve this?
I'm sure the following should do what you want
parsed.find('a').previousSibling # or something like that
This will return a NavigableString
instance, which is about the same as a unicode
instance, but you can call it unicode
to get unicode.
I will see if I can verify this and let you know.
EDIT : I just confirmed that it works:
>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>') >>> soup.find('a') <a href="/">a link</a> >>> soup.find('a').previousSibling u'Category: ' >>>