Iterate over multiple (parent, child) nodes using Python ElementTree

The standard implementation of ElementTree for Python (2.6) does not give pointers to parents from child nodes. Therefore, if parents are needed, it is suggested that the parents loop, not the children.

Consider that my xml has the form:

<Content> <Para>first</Para> <Table><Para>second</Para></Table> <Para>third</Para> </Content> 

The following finds all of the Pair nodes, excluding parents:

 (1) paras = [p for p in page.getiterator("Para")] 

This (adapted from effbot) stores the parent, iterating over them instead of the child nodes:

 (2) paras = [(c,p) for p in page.getiterator() for c in p] 

This makes sense and can be expanded with the condition to achieve (presumably) the same result as (1), but with additional information about the parents:

 (3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"] 

The ElementTree documentation assumes that the getiterator () method does a depth search. Running without parent search (1) gives:

 first second third 

However, extracting text from paragraphs in (3) gives:

 first, Content>Para third, Content>Para second, Table>Para 

It seems to have a width in width.

Therefore, two questions arise.

  • Is this the correct and expected behavior?
  • How do you retrieve (parent, child) tuples when the child must be of a certain type, but the parent can be anything if the order of the document should be supported. I do not think that executing two loops and matching (parent, child) generated (3) with the orders generated by (1) is ideal.
+4
source share
1 answer

Consider this:

 >>> xml = """<Content> ... <Para>first</Para> ... <Table><Para>second</Para></Table> ... <Para>third</Para> ... </Content>""" >>> import xml.etree.cElementTree as et >>> page = et.fromstring(xml) >>> for p in page.getiterator(): ... print "ppp", p.tag, repr(p.text) ... for c in p: ... print "ccc", c.tag, repr(c.text), p.tag ... ppp Content '\n ' ccc Para 'first' Content ccc Table None Content ccc Para 'third' Content ppp Para 'first' ppp Table None ccc Para 'second' Table ppp Para 'second' ppp Para 'third' >>> 

In addition, the list of concepts is great until you want to see what exactly is repeated :-)

getiterator creates ppp elements in declared order. However, you pluck your elements from the auxiliary ccc elements that are not in the desired order.

One solution is to do your own iteration:

 >>> def process(elem, parent): ... print elem.tag, repr(elem.text), parent.tag if parent is not None else None ... for child in elem: ... process(child, elem) ... >>> process(page, None) Content '\n ' None Para 'first' Content Table None Content Para 'second' Table Para 'third' Content >>> 

Now you can snarf "Para" elements, each with a reference to their parent (if any), when they flow.

This can be well wrapped in a gadget generator:

 >>> def iterate_with_parent(elem): ... stack = [] ... while 1: ... for child in reversed(elem): ... stack.append((child, elem)) ... if not stack: return ... elem, parent = stack.pop() ... yield elem, parent ... >>> >>> showtag = lambda e: e.tag if e is not None else None >>> showtext = lambda e: repr((e.text or '').rstrip()) >>> for e, p in iterate_with_parent(page): ... print e.tag, showtext(e), showtag(p) ... Para 'first' Content Table '' Content Para 'second' Table Para 'third' Content >>> 
+5
source

Source: https://habr.com/ru/post/1335931/


All Articles