Python lxml / great soup to find all links on a web page

Question

Python lxml / great soup to find all links on a web page

I am writing a script to read a web page and create a database of links matching certain criteria. Right now I am stuck in lxml and understand how to grab all <a href> from html ...

 result = self._openurl(self.mainurl) content = result.read() html = lxml.html.fromstring(content) print lxml.html.find_rel_links(html,'href')

+6

python lxml

Cmag May 25, '11 at 21:20

source share

3 answers

With iterlinks , lxml provides an excellent function for this task.

This gives (element, attribute, link, pos) for each link [...] in action, archive, background, citation, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc.

+3

Sergey Stegneev May 28 '11 at 7:55

source share

I want to provide an alternative solution based on lxml.

The solution uses the function provided in lxml.cssselect

  import urllib import lxml.html from lxml.cssselect import CSSSelector connection = urllib.urlopen('http://www.yourTargetURL/') dom = lxml.html.fromstring(connection.read()) selAnchor = CSSSelector('a') foundElements = selAnchor(dom) print [e.get('href') for e in foundElements]

0

吳強福 Aug 16 '11 at 7:53

source share

Fred foo · Accepted Answer · 2011-05-25T21:27:04+0000

Use XPath. Something like (can't check here):

 urls = html.xpath('//a/@href')

Python lxml / great soup to find all links on a web page

More articles: