The standard implementation of ElementTree for Python (2.6) does not give pointers to parents from child nodes. Therefore, if parents are needed, it is suggested that the parents loop, not the children.
Consider that my xml has the form:
<Content> <Para>first</Para> <Table><Para>second</Para></Table> <Para>third</Para> </Content>
The following finds all of the Pair nodes, excluding parents:
(1) paras = [p for p in page.getiterator("Para")]
This (adapted from effbot) stores the parent, iterating over them instead of the child nodes:
(2) paras = [(c,p) for p in page.getiterator() for c in p]
This makes sense and can be expanded with the condition to achieve (presumably) the same result as (1), but with additional information about the parents:
(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]
The ElementTree documentation assumes that the getiterator () method does a depth search. Running without parent search (1) gives:
first second third
However, extracting text from paragraphs in (3) gives:
first, Content>Para third, Content>Para second, Table>Para
It seems to have a width in width.
Therefore, two questions arise.
- Is this the correct and expected behavior?
- How do you retrieve (parent, child) tuples when the child must be of a certain type, but the parent can be anything if the order of the document should be supported. I do not think that executing two loops and matching (parent, child) generated (3) with the orders generated by (1) is ideal.