Tagging an html page using python 2.7 with beautifulsoup

Question

Tagging an html page using python 2.7 with beautifulsoup

I am trying to parse an html page with the given format:

<img class="outer" id="first" /> <div class="content" .../> <div class="content" .../> <div class="content" /> <img class="outer" id="second" /> <div class="content" .../> <div class="content" .../> <img class="outer" id="third" /> <div class="content" .../> <div class="content" .../>

When repeating div tags, I want to find out if the current div tag is under the img tag with the identifier "first", "second" or "third". Is there any way to do this? I have a list of img blocks and div blocks:

 img_blocks = soup.find_all('img', attrs={'class':'outer'}) div_Blocks = soup.find_all('div', attrs={'class':'content'})

+4

python python-2.7 beautifulsoup

Ranjan Jun 30 '13 at 7:12

source share

2 answers

Not from your current starting point - you need to iterate over all tags, or at least tags of both types, if the tag is of type img, then save the identifier, if the class is a div, then the current saved identifier tells you which container you are in. NB You can use re in BS to filter only two types.

You are currently deleting the context by retrieving only the tags.

0

Steve barnes Jun 30 '13 at 7:25

source share

Terrya · Accepted Answer · 2013-06-30T07:19:23+0000

Use .find_previous_sibling :

 >>> for divtag in div_Blocks: ... print divtag.find_previous_sibling('img') ... <img class="outer" id="first"/> <img class="outer" id="first"/> <img class="outer" id="first"/> <img class="outer" id="second"/> <img class="outer" id="second"/> <img class="outer" id="third"/> <img class="outer" id="third"/>

Tagging an html page using python 2.7 with beautifulsoup

More articles: