Python lxml / great soup to find all links on a web page

I am writing a script to read a web page and create a database of links matching certain criteria. Right now I am stuck in lxml and understand how to grab all <a href> from html ...

 result = self._openurl(self.mainurl) content = result.read() html = lxml.html.fromstring(content) print lxml.html.find_rel_links(html,'href') 
+6
source share
3 answers

Use XPath. Something like (can't check here):

 urls = html.xpath('//a/@href') 
+7
source

With iterlinks , lxml provides an excellent function for this task.

This gives (element, attribute, link, pos) for each link [...] in action, archive, background, citation, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc.

+3
source

I want to provide an alternative solution based on lxml.

The solution uses the function provided in lxml.cssselect

  import urllib import lxml.html from lxml.cssselect import CSSSelector connection = urllib.urlopen('http://www.yourTargetURL/') dom = lxml.html.fromstring(connection.read()) selAnchor = CSSSelector('a') foundElements = selAnchor(dom) print [e.get('href') for e in foundElements] 
0
source

Source: https://habr.com/ru/post/889042/


All Articles