How to further filter the result of a ResultSet?

Question

How to further filter the result of a ResultSet?

I am trying to get a list of all hrefs in an html document. I am using Beautiful Soap to parse my html file.

print soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})[0]

The result is:

<a class="m0 vl" data-tag="Homepage Library" href="/video?lang=pl&amp;format=lite&amp;v=AZpftzD9jVs" title="abc">
        text
    </a>

I'm only interested in the href = "" part. Therefore, I would like the ResultSet to return only href.

I'm not sure how to extend this request, so it returns the href part.

+4

python beautifulsoup

LLaP Mar 07 '14 at 20:53

source share

2 answers

alecxe · Answer 1 · 2014-03-07T20:54:51+0000

Use attrs :

links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link.attrs['href'] for link in links]

or get attributes directly from an element, treating it like a dictionary:

links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link['href'] for link in links]

DEMO:

from bs4 import BeautifulSoup


page = """<body>
<a href="link1">text1</a>
<a href="link2">text2</a>
<a href="link3">text3</a>
<a href="link4">text4</a>
</body>"""

soup = BeautifulSoup(page)
links = soup.body.find_all('a')
print [link.attrs['href'] for link in links]

prints

['link1', 'link2', 'link3', 'link4']

Hope this helps.

LLaP · Answer 2 · 2014-03-07T21:14:37+0000

Finally, this worked for me:

soup.body.find_all('a', attrs={'data-tag':'Homepage Library'}).attrs["href"]

How to further filter the result of a ResultSet?

More articles: