Get all text inside tag in lxml

Question

Get all text inside tag in lxml

I would like to write a piece of code that captures all the text inside the <content> in lxml in all three instances below, including code tags. I tried tostring(getchildren()) , but that would skip the text between the tags. I was not very lucky in finding an API for the corresponding function. could you help me?

 <!--1--> <content> <div>Text inside tag</div> </content> #should return "<div>Text inside tag</div> <!--2--> <content> Text with no tag </content> #should return "Text with no tag" <!--3--> <content> Text outside tag <div>Text inside tag</div> </content> #should return "Text outside tag <div>Text inside tag</div>"

+63

python parsing lxml

Kevin Burke Jan 07 2018-11-11T00:

source share

14 answers

Does text_content () help you do what you need?

+66

Ed Summers Aug 15 '12 at 3:14

source share

Just use the node.itertext() method, as in:

  ''.join(node.itertext())

+55

Arthur Debert Feb 25 '13 at 19:00

source share

The following snippet that uses python generators works great and is very efficient.

''.join(node.itertext()).strip()

+17

Sandeep Jun 27. '16 at 11:08

source share

Alberts version of stringify-content, which fixes bugs reported by hoju:

 def stringify_children(node): from lxml.etree import tostring from itertools import chain return ''.join( chunk for chunk in chain( (node.text,), chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())), (node.tail,)) if chunk)

+16

anana Jan 27 '15 at 15:23

source share

 import urllib2 from lxml import etree url = 'some_url'

Get URL

 test = urllib2.urlopen(url) page = test.read()

getting all html code inside table tag

 tree = etree.HTML(page)

xpath selector

 table = tree.xpath("xpath_here") res = etree.tostring(table)

res - the html code of the table this worked for me.

so that you can retrieve the contents of tags using xpath_text () and tags, including their contents, using tostring ()

 div = tree.xpath("//div") div_res = etree.tostring(div)

 text = tree.xpath_text("//content")

or text = tree.xpath ("// content / text ()")

 div_3 = tree.xpath("//content") div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line using the strip method is not nice, but it just works

+4

d3day Aug 19 '12 at 1:14

source share

Defining stringify_children this way can be less complicated:

 from lxml import etree def stringify_children(node): s = node.text if s is None: s = '' for child in node: s += etree.tostring(child, encoding='unicode') return s

or in one line

 return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

The rationale is the same as in this answer : leave serialization of child nodes to lxml. tail part of the node is not interesting in this case, since it is “behind” the end tag. Note that the encoding argument can be changed as needed.

Another possible solution is to serialize the node itself and then discard the start and end tags:

 def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat awful. This code is only valid if node has no attributes, and I don’t think anyone will want to use it even then.

+3

Percival Ulysses Jun 10 '14 at 22:26

source share

In response to @Richard's comment above, if you fix stringify_children to read:

  parts = ([node.text] + -- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + ++ list(chain(*([tostring(c)] for c in node.getchildren()))) + [node.tail])

he seems to be avoiding the duplication to which he refers.

+2

bwingenroth Apr 30 '13 at 16:18

source share

One of the simplest code snippets that really worked for me and according to the documentation in http://lxml.de/tutorial.html#using-xpath-to-find-text ,

 etree.tostring(html, method="text")

where etree is the node / tag whose full text you are trying to read. This is why it does not get rid of script and style tags.

+2

Deepan Prabhu Babu Jul 05 '17 at 6:53 on

source share

I know this is an old question, but this is a general problem, and I have a solution that seems simpler than the ones proposed so far:

 def stringify_children(node): """Given a LXML tag, return contents as a string >>> html = "<p><strong>Sample sentence</strong> with tags.</p>" >>> node = lxml.html.fragment_fromstring(html) >>> extract_html_content(node) "<strong>Sample sentence</strong> with tags." """ if node is None or (len(node) == 0 and not getattr(node, 'text', None)): return "" node.attrib.clear() opening_tag = len(node.tag) + 2 closing_tag = -(len(node.tag) + 3) return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some other answers to this question, this solution saves all the tags contained in it and attacks the problem from a different angle than other working solutions.

+1

Joshmaker Sep 08 '15 at 10:22

source share

Here is a working solution. We can get the content with the parent tag, and then cut the parent tag from the output.

 import re from lxml import etree def _tostr_with_tags(parent_element, html_entities=False): RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' content_with_parent = etree.tostring(parent_element) def _replace_html_entities(s): RE_ENTITY = r'&#(\d+);' def repl(m): return unichr(int(m.group(1))) replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE) return replaced if not html_entities: content_with_parent = _replace_html_entities(content_with_parent) content_with_parent = content_with_parent.strip() # remove 'white' characters on margins start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0] if start_tag != end_tag: raise Exception('Start tag does not match to end tag while getting content with tags.') return content_without_parent

parent_element must be of type Element .

Please note that if you want text content (and not html objects in the text), please leave the html_entities parameter html_entities to False.

0

sergzach Aug 18 '17 at 17:12

source share

lxml have a way to do this:

 node.text_content()

0

Hrabal Oct 08 '17 at 8:36 on

source share

If this is a tag, you can try:

 node.values()

-2

David Nov 14 '12 at 16:30

source share

 import re from lxml import etree node = etree.fromstring(""" <content>Text before inner tag <div>Text <em>inside</em> tag </div> Text after inner tag </content>""") print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

-3

kazufusa Jan 08 '15 at 0:59

source share

albertov · Accepted Answer · 2011-01-07 09:35

Try:

 def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

Example:

 from lxml import etree node = etree.fromstring("""<content> Text outside tag <div>Text <em>inside</em> tag</div> </content>""") stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Get all text inside tag in lxml

More articles: