I am doing a small project in which I present cases of political leaders in newspapers. Sometimes a politician is mentioned, and there is no parent or child with a link. (due to the fact that I guess about semantically bad markup).
So, I want to create a function that can find the closest link, and then extract it. In the case below, the search string is Rasmussen , and I want the link: /307046 .
#-*- coding: utf-8 -*- from bs4 import BeautifulSoup import re tekst = ''' <li> <div class="views-field-field-webrubrik-value"> <h3> <a href="/307046">Claus Hjort spiller med mrkede kort</a> </h3> </div> <div class="views-field-field-skribent-uid"> <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div> </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en LΓΈkke Rasmussens genkomst som statsministe </div> </div> <span class="views-field-view-node"> <span class="actions"> <a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a> </span> </span> </li> ''' to_find = "Rasmussen" soup = BeautifulSoup(tekst) contexts = soup.find_all(text=re.compile(to_find)) def find_nearest(element, url, direction="both"): """Find the nearest link, relative to a text string. When complete it will search up and down (parent, child), and only X levels up down. These features are not implemented yet. Will then return the link the fewest steps away from the original element. Assumes we have already found an element""" # Is the nearest link readily available? # If so - this works and extracts the link. if element.find_parents('a'): for artikel_link in element.find_parents('a'): link = artikel_link.get('href') # sometimes the link is a relative link - sometimes it is not if ("http" or "www") not in link: link = url+link return link # But if the link is not readily available, we will go up # This is (I think) where it goes wrong # βββββββββββββββββββββββββββββββββββ if not element.find_parents('a'): element = element.parent # Print for debugging print element #on the 2nd run (ie <li> this finds <a href=/307056> # So shouldn't it be caught as readily available above? print u"Found: %s" % element.name # the recursive call find_nearest(element,url) # run it if contexts: for a in contexts: find_nearest( element=a, url="http://information.dk")
Direct call below works:
print contexts[0].parent.parent.parent.a['href'].encode('utf-8')
For reference, the entire error code is on the bitpack: https://bitbucket.org/achristoffersen/politikere-i-medierne
(ps Using BeautifullSoup 4)
EDIT : SimonSapin asks me to determine the closest: closest I mean the link, which is the least number of nesting levels from a search query in any direction. In the above text, a href created by the newspaper site based on drupal is neither a direct parent nor a child of the tag where the search string is found. So BeautifullSoup Can't find it.
I suspect the βleast number of chartsβ will often work. In this case, the soul can be hacked along with find and rfind - but I would really like to do it through BS. As this would work: contexts[0].parent.parent.parent.a['href'].encode('utf-8') should be possible to generalize this to a script.
EDIT . Perhaps I should emphasize that I am looking for a solution to BeautifulSoup. Combining BS with custom / simpel hack-first-search, as @ erik85 suggested, will quickly become messy, I think.