How to replace links using lxml and iterlinks

I am new to lxml and I'm trying to figure out how to rewrite links using iterlinks ().

import lxml.html html = lxml.html.document_fromstring(doc) for element, attribute, link, pos in html.iterlinks(): if attibute == "src": link = link.replace('foo', 'bar') print lxml.html.tostring(html) 

However, this does not actually replace links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.

Thanks in advance.

+6
source share
3 answers

Instead of just assigning a new (string) value to the link variable name, you need to change the element itself, in this case by setting its src attribute:

 new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar') element.set('src', new_src) 

Please note: if you know which โ€œlinksโ€ you are interested in, for example, only img tags, you can also get elements using .findall() (or xpath or css selectors) instead of using .iterlinks() .

+6
source

lxml provides a rewrite_links method (or a function through which you pass the text to be parsed in a document) to provide a way to change all links in a document:

.rewrite_links (link_repl_func, resolve_base_href = True, base_href = None): This rewrites all the links in the document using your link replacement function. If you specify base_href, all links will be sent after joining this URL. For each link, link_repl_func (link) is called. This function then returns a new link or None to remove the attribute or tag containing the link. Please note that all links will be transmitted, including links such as "#anchor" (which is purely internal), and things like "mailto: bob@example.com " (or javascript: ...).

+1
source

The link is probably just a copy of the actual object. Try replacing the element attribute in your loop. Even an element may just be a copy, but it deserves a try ...

0
source

Source: https://habr.com/ru/post/886693/


All Articles