Removing tags from HTML using BeautifulSoup

Question

Removing tags from HTML using BeautifulSoup

I am new to python and I use BeautifulSoup to analyze the website and then to extract the data. I have the following code:

for line in raw_data: #raw_data is the parsed html separated into smaller blocks
    d = {}
    d['name'] = line.find('div', {'class':'torrentname'}).find('a')
    print d['name']

<a href="/ubuntu-9-10-desktop-i386-t3144211.html">
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

Normally I could extract "Ubuntu 9.10 desktop (i386)" by writing:

d['name'] = line.find('div', {'class':'torrentname'}).find('a').string

but due to strong html tags it returns None. Is there a way to extract strong tags and then use .string or is there a better way? I tried using the BeautifulSoup extract () function, but I could not get it to work.

Edit: I only realized that my solution does not work if there are two sets of strong tags, since the space between words is not taken into account. What is the way to fix this problem?

+3

python html parsing beautifulsoup

Flowof soul Aug 27 '10 at 15:30

source share

1

Matt Austin · Accepted Answer · 2010-08-29T03:54:02+0000

".text":

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

findAll ( = True):

anchor = line.find('div', {'class':'torrentname'}).find('a')
d['name'] = ''.join(anchor.findAll(text=True))

Removing tags from HTML using BeautifulSoup

More articles: