Problem with access to attributes in BeautifulSoup

Question

Problem with access to attributes in BeautifulSoup

I'm having problems using Python (2.7). The code mainly consists of:

str = '<el at="some">ABC</el><el>DEF</el>' z = BeautifulStoneSoup(str) for x in z.findAll('el'): # if 'at' in x: # if hasattr(x, 'at'): print x['at'] else: print 'nothing'

I expected the first if to work correctly (that is, if at does not exist, type "nothing" ), but it always doesn’t print anything (that is: always False ). The second if , on the other hand, is always True , which will raise the KeyError code when trying to access at from the second <el> element, which, of course, does not exist.

+6

python attributes beautifulsoup

NullUserException May 01 '11 at 12:13

source share

4 answers

If your code is as simple as you provided, you can solve it in a compact way with

 for x in z.findAll('el'): print x.get('at', 'nothing')

+1

Jinx May 01 '11 at 15:47

source share

To just browse an element by tag name, the pyparsing solution can be more readable (and without using an outdated API like has_key ):

 from pyparsing import makeXMLTags # makeXMLTags creates a pyparsing expression that matches tags with # variations in whitespace, attributes, etc. el,elEnd = makeXMLTags('el') # scan the input text and work with elTags for elTag, tagstart, tagend in el.scanString(xmltext): if elTag.at: print elTag.at

For further clarification, pyparsing allows you to define a parsing action so that the tags only match when a specific attribute value (or attribute-any value) is found:

 # import parse action that will filter by attribute from pyparsing import withAttribute # only match el tags having the 'at' attribute, with any value el.setParseAction(withAttribute(at=withAttribute.ANY_VALUE)) # now loop again, but no need to test for presence of 'at' # attribute - there will be no match if 'at' is not present for elTag, tagstart, tagend in el.scanString(xmltext): print elTag.at

+1

Paulmcg May 01, '11 at 21:04

source share

I usually use the get () method to access the attribute

 link = soup.find('a') href = link.get('href') name = link.get('name') if name: print 'anchor' if href: print 'link'

0

Andreas Jung May 01, '11 at 13:07

source share

Eli bendersky · Accepted Answer · 2011-05-01T12:25:36+0000

The in operator is for sequence and display types, which makes you think that the object returned by BeautifulSoup should implement it correctly? According to BeautifulSoup docs, you should access attributes using the [] syntax.

Re hasattr , I think you are confusing HTML / XML attributes and Python object attributes. hasattr for the latter, and BeaitufulSoup AFAIK does not reflect the HTML / XML attributes that it parsed in its own object attributes.

PS note that the Tag object in BeautifulSoup implements __contains__ - maybe you are trying with the wrong object? Can you show a complete but minimal example demonstrating the problem?

Launch:

 from BeautifulSoup import BeautifulSoup str = '<el at="some">ABC</el><el>DEF</el>' z = BeautifulSoup(str) for x in z.findAll('el'): print type(x) print x['at']

I get:

 <class 'BeautifulSoup.Tag'> some <class 'BeautifulSoup.Tag'> Traceback (most recent call last): File "soup4.py", line 8, in <module> print x['at'] File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 601, in __getitem__ return self._getAttrMap()[key] KeyError: 'at'

This is what I expected. The first el has an at attribute, the second does not - and this raises a KeyError .

Update 2: BeautifulSoup.Tag.__contains__ looks at the contents of the tag, not its attributes. To check if an attribute exists, use in .

Problem with access to attributes in BeautifulSoup

More articles: