How to remove tag from node in lxml without tail?

Example:

html = <a><b>Text</b>Text2</a>

BeautifullSoup Code

[x.extract() for x in html.findAll(.//b)]

in conclusion:

html = <a>Text2</a>

Lxml Code:

[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

in conclusion:

html = <a></a>

because lxml thinks "Text2" is the tail <b></b>

If we only need a text string from a tag pool, we can use:

for bad in raw.xpath(xpath_search):
    bad.text = ''

But how to do this without changing the text, but remove tags without a tail?

+4
source share
2 answers

I did the following to protect the tail from the text of the previous parent or parent.

def remove_keeping_tail(self, element):
    """Safe the tail text and then delete the element"""
    self._preserve_tail_before_delete(element)
    element.getparent().remove(element)

def _preserve_tail_before_delete(self, node):
    if node.tail: # preserve the tail
        previous = node.getprevious()
        if previous is not None: # if there is a previous sibling it will get the tail
            if previous.tail is None:
                previous.tail = node.tail
            else:
                previous.tail = previous.tail + node.tail
        else: # The parent get the tail as text
            parent = node.getparent()
            if parent.text is None:
                parent.text = node.tail
            else:
                parent.text = parent.text + node.tail

NTN

+3
source

While the accepted answer from phlou will work, there are simpler ways to remove tags without removing their tails.

If you want to delete a specific element, then the LXML method you are looking for is drop_tree.

:

. el.getparent(). Remove (el) ; drop_tree .

, lxml.etree.strip_elements lxml.html.etree.strip_elements withtails=False.

. , , . , with_tail False.

, :

>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
...    bad.drop_tag()
>>> tostring(html)
'<a>Text2</a>'

>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html)
'<a>Text2</a>'
+3

Source: https://habr.com/ru/post/1672847/


All Articles