BeautifulSoup: Separate the specified attributes, but keep the tag and its contents.

Question

BeautifulSoup: Separate the specified attributes, but keep the tag and its contents.

I am trying to "defrontpagify" the html of the created MS FrontPage website and I am writing a BeautifulSoup script to do this.

However, I am stuck in the part where I am trying to remove a specific attribute (or list attributes) from each tag in the document that contains them. Code snippet:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font', 'dir','face','size','color','style','class','width','height','hspace', 'border','valign','align','background','bgcolor','text','link','vlink', 'alink','cellpadding','cellspacing'] # remove all attributes in REMOVE_ATTRIBUTES from all tags, # but preserve the tag and its content. for attribute in REMOVE_ATTRIBUTES: for tag in soup.findAll(attribute=True): del(tag[attribute])

It works without errors, but actually does not break any attributes. When I run it without an external loop, just hard-coded a single attribute (soup.findAll ('style' = True), it works.

Does anyone see a problem here?

PS - I also don't like nested loops. If someone knows a more functional map / filter-ish style, I would love to see it.

+4

python web-scraping beautifulsoup frontpage scraper

Kurtosis Jan 28 '12 at 9:03

source share

2 answers

I am using BeautifulSoup 4 with python 2.7 and for me tag.attrs is a dictionary, not a list. So I had to change this code:

  for tag in soup.recursiveChildGenerator(): if hasattr(tag, 'attrs'): tag.attrs = {key:value for key,value in tag.attrs.iteritems() if key not in REMOVE_ATTRIBUTES}

+3

Nóra Oct 11 '16 at 11:16

source share

unutbu · Accepted Answer · 2012-01-28T13:48:57+0000

Line

 for tag in soup.findAll(attribute=True):

won't find tag s. It is possible to use the findAll method; I'm not sure. However, this works:

 import BeautifulSoup REMOVE_ATTRIBUTES = [ 'lang','language','onmouseover','onmouseout','script','style','font', 'dir','face','size','color','style','class','width','height','hspace', 'border','valign','align','background','bgcolor','text','link','vlink', 'alink','cellpadding','cellspacing'] doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>''' soup = BeautifulSoup.BeautifulSoup(doc) for tag in soup.recursiveChildGenerator(): try: tag.attrs = [(key,value) for key,value in tag.attrs if key not in REMOVE_ATTRIBUTES] except AttributeError: # 'NavigableString' object has no attribute 'attrs' pass print(soup.prettify())

BeautifulSoup: Separate the specified attributes, but keep the tag and its contents.

More articles: