Get all text inside tag in lxml

I would like to write a piece of code that captures all the text inside the <content> in lxml in all three instances below, including code tags. I tried tostring(getchildren()) , but that would skip the text between the tags. I was not very lucky in finding an API for the corresponding function. could you help me?

 <!--1--> <content> <div>Text inside tag</div> </content> #should return "<div>Text inside tag</div> <!--2--> <content> Text with no tag </content> #should return "Text with no tag" <!--3--> <content> Text outside tag <div>Text inside tag</div> </content> #should return "Text outside tag <div>Text inside tag</div>" 
+63
python parsing lxml
Jan 07 2018-11-11T00:
source share
14 answers

Try:

 def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts)) 

Example:

 from lxml import etree node = etree.fromstring("""<content> Text outside tag <div>Text <em>inside</em> tag</div> </content>""") stringify_children(node) 

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

+41
Jan 07 2018-11-11T00:
source share

Does text_content () help you do what you need?

+66
Aug 15 '12 at 3:14
source share

Just use the node.itertext() method, as in:

  ''.join(node.itertext()) 
+55
Feb 25 '13 at 19:00
source share

The following snippet that uses python generators works great and is very efficient.

''.join(node.itertext()).strip()

+17
Jun 27. '16 at 11:08
source share

Alberts version of stringify-content, which fixes bugs reported by hoju:

 def stringify_children(node): from lxml.etree import tostring from itertools import chain return ''.join( chunk for chunk in chain( (node.text,), chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())), (node.tail,)) if chunk) 
+16
Jan 27 '15 at 15:23
source share
 import urllib2 from lxml import etree url = 'some_url' 

Get URL

 test = urllib2.urlopen(url) page = test.read() 

getting all html code inside table tag

 tree = etree.HTML(page) 

xpath selector

 table = tree.xpath("xpath_here") res = etree.tostring(table) 

res - the html code of the table this worked for me.

so that you can retrieve the contents of tags using xpath_text () and tags, including their contents, using tostring ()

 div = tree.xpath("//div") div_res = etree.tostring(div) 
 text = tree.xpath_text("//content") 

or text = tree.xpath ("// content / text ()")

 div_3 = tree.xpath("//content") div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</') 

this last line using the strip method is not nice, but it just works

+4
Aug 19 '12 at 1:14
source share

Defining stringify_children this way can be less complicated:

 from lxml import etree def stringify_children(node): s = node.text if s is None: s = '' for child in node: s += etree.tostring(child, encoding='unicode') return s 

or in one line

 return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node)) 

The rationale is the same as in this answer : leave serialization of child nodes to lxml. tail part of the node is not interesting in this case, since it is β€œbehind” the end tag. Note that the encoding argument can be changed as needed.

Another possible solution is to serialize the node itself and then discard the start and end tags:

 def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2] 

which is somewhat awful. This code is only valid if node has no attributes, and I don’t think anyone will want to use it even then.

+3
Jun 10 '14 at 22:26
source share

In response to @Richard's comment above, if you fix stringify_children to read:

  parts = ([node.text] + -- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + ++ list(chain(*([tostring(c)] for c in node.getchildren()))) + [node.tail]) 

he seems to be avoiding the duplication to which he refers.

+2
Apr 30 '13 at 16:18
source share

One of the simplest code snippets that really worked for me and according to the documentation in http://lxml.de/tutorial.html#using-xpath-to-find-text ,

 etree.tostring(html, method="text") 

where etree is the node / tag whose full text you are trying to read. This is why it does not get rid of script and style tags.

+2
Jul 05 '17 at 6:53 on
source share

I know this is an old question, but this is a general problem, and I have a solution that seems simpler than the ones proposed so far:

 def stringify_children(node): """Given a LXML tag, return contents as a string >>> html = "<p><strong>Sample sentence</strong> with tags.</p>" >>> node = lxml.html.fragment_fromstring(html) >>> extract_html_content(node) "<strong>Sample sentence</strong> with tags." """ if node is None or (len(node) == 0 and not getattr(node, 'text', None)): return "" node.attrib.clear() opening_tag = len(node.tag) + 2 closing_tag = -(len(node.tag) + 3) return lxml.html.tostring(node)[opening_tag:closing_tag] 

Unlike some other answers to this question, this solution saves all the tags contained in it and attacks the problem from a different angle than other working solutions.

+1
Sep 08 '15 at 10:22
source share

Here is a working solution. We can get the content with the parent tag, and then cut the parent tag from the output.

 import re from lxml import etree def _tostr_with_tags(parent_element, html_entities=False): RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' content_with_parent = etree.tostring(parent_element) def _replace_html_entities(s): RE_ENTITY = r'&#(\d+);' def repl(m): return unichr(int(m.group(1))) replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE) return replaced if not html_entities: content_with_parent = _replace_html_entities(content_with_parent) content_with_parent = content_with_parent.strip() # remove 'white' characters on margins start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0] if start_tag != end_tag: raise Exception('Start tag does not match to end tag while getting content with tags.') return content_without_parent 

parent_element must be of type Element .

Please note that if you want text content (and not html objects in the text), please leave the html_entities parameter html_entities to False.

0
Aug 18 '17 at 17:12
source share

lxml have a way to do this:

 node.text_content() 
0
Oct 08 '17 at 8:36 on
source share

If this is a tag, you can try:

 node.values() 
-2
Nov 14 '12 at 16:30
source share
 import re from lxml import etree node = etree.fromstring(""" <content>Text before inner tag <div>Text <em>inside</em> tag </div> Text after inner tag </content>""") print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1) 
-3
Jan 08 '15 at 0:59
source share



All Articles