Category:

Use BeautifulSoup to extract text before the first user tag

From this html source:

<div class="category_link"> Category: <a href="/category/personal">Personal</a> </div> 

I want to extract the text Category:

Here are my attempts using Python / BeautifulSoup (with output as comment - after #)

 parsed = BeautifulSoup(sample_html) parsed_div = parsed.findAll('div')[0] parsed_div.firstText() # <a href="/category/personal">Personal</a> parsed_div.first() # <a href="/category/personal">Personal</a> parsed_div.findAll()[0] # <a href="/category/personal">Personal</a> 

I expect the text node to be available as the first child. Any suggestions on how I can solve this?

+6
source share
1 answer

I'm sure the following should do what you want

 parsed.find('a').previousSibling # or something like that 

This will return a NavigableString instance, which is about the same as a unicode instance, but you can call it unicode to get unicode.

I will see if I can verify this and let you know.

EDIT : I just confirmed that it works:

 >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>') >>> soup.find('a') <a href="/">a link</a> >>> soup.find('a').previousSibling u'Category: ' >>> 
+11
source

Source: https://habr.com/ru/post/913204/


All Articles