Find the closest list with BeautifullSoup (python)

Question

Find the closest list with BeautifullSoup (python)

I am doing a small project in which I present cases of political leaders in newspapers. Sometimes a politician is mentioned, and there is no parent or child with a link. (due to the fact that I guess about semantically bad markup).

So, I want to create a function that can find the closest link, and then extract it. In the case below, the search string is Rasmussen , and I want the link: /307046 .

 #-*- coding: utf-8 -*- from bs4 import BeautifulSoup import re tekst = ''' <li> <div class="views-field-field-webrubrik-value"> <h3> <a href="/307046">Claus Hjort spiller med mrkede kort</a> </h3> </div> <div class="views-field-field-skribent-uid"> <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div> </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Løkke Rasmussens genkomst som statsministe </div> </div> <span class="views-field-view-node"> <span class="actions"> <a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a> </span> </span> </li> ''' to_find = "Rasmussen" soup = BeautifulSoup(tekst) contexts = soup.find_all(text=re.compile(to_find)) def find_nearest(element, url, direction="both"): """Find the nearest link, relative to a text string. When complete it will search up and down (parent, child), and only X levels up down. These features are not implemented yet. Will then return the link the fewest steps away from the original element. Assumes we have already found an element""" # Is the nearest link readily available? # If so - this works and extracts the link. if element.find_parents('a'): for artikel_link in element.find_parents('a'): link = artikel_link.get('href') # sometimes the link is a relative link - sometimes it is not if ("http" or "www") not in link: link = url+link return link # But if the link is not readily available, we will go up # This is (I think) where it goes wrong # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ if not element.find_parents('a'): element = element.parent # Print for debugging print element #on the 2nd run (ie <li> this finds <a href=/307056> # So shouldn't it be caught as readily available above? print u"Found: %s" % element.name # the recursive call find_nearest(element,url) # run it if contexts: for a in contexts: find_nearest( element=a, url="http://information.dk")

Direct call below works:

 print contexts[0].parent.parent.parent.a['href'].encode('utf-8')

For reference, the entire error code is on the bitpack: https://bitbucket.org/achristoffersen/politikere-i-medierne

(ps Using BeautifullSoup 4)

EDIT : SimonSapin asks me to determine the closest: closest I mean the link, which is the least number of nesting levels from a search query in any direction. In the above text, a href created by the newspaper site based on drupal is neither a direct parent nor a child of the tag where the search string is found. So BeautifullSoup Can't find it.

I suspect the “least number of charts” will often work. In this case, the soul can be hacked along with find and rfind - but I would really like to do it through BS. As this would work: contexts[0].parent.parent.parent.a['href'].encode('utf-8') should be possible to generalize this to a script.

EDIT . Perhaps I should emphasize that I am looking for a solution to BeautifulSoup. Combining BS with custom / simpel hack-first-search, as @ erik85 suggested, will quickly become messy, I think.

+6

python lxml beautifulsoup

Andreas Aug 2 '12 at 11:06

source share

2 answers

Someone will probably come up with a solution that works with copy and paste, and you will think that this solves your problem. However, your problem is not in the code. This is your strategy. There is a software development principle called divide and conquer that you must apply to your code in the redesign: separate the code that interprets your HTML lines as trees / graphics from the search for the nearest node (possibly breadth-first-search ) . Not only will you learn how to develop better software, your problem will probably just cease to exist .

I think you are smart enough to solve this yourself, but I also want to provide a skeleton:

 def parse_html(txt): """ reads a string of html and returns a dict/list/tuple presentation""" pass def breadth_first_search(graph, start, end): """ finds the shortest way from start to end You can probably customize start and end to work well with the input you want to provide. For implementation details see the link in the text above. """ pass def find_nearest_link(html,name): """putting it all together""" return breadth_first_search(parse_html(html),name,"link")

PS: For this, another principle is also applied, but from mathematics: if there is a problem, you do not know the solution (find links close to the selected substring), and there is a group of problems that you know about the solution (bypassing the graph), and then try to convert your problem in accordance with a group of problems that you can solve, so you can simply use the basic solution templates (which may even be implemented in the language / framework of your choice), and you're done.

+12

erikbwork Aug 4 '12 at 12:02

source share

unutbu · Accepted Answer · 2012-08-07T20:20:28+0000

Here is a solution using lxml. The basic idea is to find all the previous and next elements, and then iterate roundrobin through these elements:

 def find_nearest(elt): preceding = elt.xpath('preceding::*/@href')[::-1] following = elt.xpath('following::*/@href') parent = elt.xpath('parent::*/@href') for href in roundrobin(parent, preceding, following): return href

A similar solution using BeautifulSoups ('s or bs4) next_elements and previous_elements should also be possible.

 import lxml.html as LH import itertools def find_nearest(elt): preceding = elt.xpath('preceding::*/@href')[::-1] following = elt.xpath('following::*/@href') parent = elt.xpath('parent::*/@href') for href in roundrobin(parent, preceding, following): return href def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> ADEBFC" # http://docs.python.org/library/itertools.html#recipes # Author: George Sakkis pending = len(iterables) nexts = itertools.cycle(iter(it).next for it in iterables) while pending: try: for n in nexts: yield n() except StopIteration: pending -= 1 nexts = itertools.cycle(itertools.islice(nexts, pending)) tekst = ''' <li> <div class="views-field-field-webrubrik-value"> <h3> <a href="/307046">Claus Hjort spiller med mrkede kort</a> </h3> </div> <div class="views-field-field-skribent-uid"> <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div> </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Løkke Rasmussens genkomst som statsministe </div> </div> <span class="views-field-view-node"> <span class="actions"> <a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a> </span> </span> </li> ''' to_find = "Rasmussen" doc = LH.fromstring(tekst) for x in doc.xpath('//*[contains(text(),{s!r})]'.format(s = to_find)): print(find_nearest(x))

gives

 /307046

Find the closest list with BeautifullSoup (python)

More articles: