Should memory usage increase when using ElementTree.iterparse () when clearing () trees?

Question

Should memory usage increase when using ElementTree.iterparse () when clearing () trees?

import os import xml.etree.ElementTree as et for ev, el in et.iterparse(os.sys.stdin): el.clear()

The implementation of the above in the ODP structure of the RDF dump leads to a constant increase in memory. Why is this? I understand that ElementTree is still building a parse tree, albeit with clear() ed child nodes. If this is the reason for this memory usage pattern, is there a way around it?

+6

python memory-leaks elementtree

Pedro silva Apr 9 '12 at 13:49

source share

3 answers

As mentioned in Kevin Guerra's answer, the "root.clear ()" strategy in the ElementTree documentation removes the fully parsed children of the root. If these children fasten huge branches, it is not very useful.

He touched on the perfect solution, but did not add any code, so here is an example:

 element_stack = [] context = ET.iterparse(stream, events=('start', 'end')) for event, elem in context: if event == 'start': element_stack.append(elem) elif event == 'end': element_stack.pop() # see if elem is one of interest and do something with it here if element_stack: element_stack[-1].remove(elem) del context

The item of interest will not have sub-elements; they will be deleted as soon as their end tags are visible. It might be OK if all you need is element text or attributes.

If you want to query the descendants of an element, you need to create a full branch for it. To do this, maintain a flag implemented as a depth counter for these elements. Only call .remove () when the depth is zero:

 element_stack = [] interesting_element_depth = 0 context = ET.iterparse(stream, events=('start', 'end')) for event, elem in context: if event == 'start': element_stack.append(elem) if elem.tag == 'foo': interesting_element_depth += 1 elif event == 'end': element_stack.pop() if elem.tag == 'foo': interesting_element_depth -= 1 # do something with elem and its descendants here if element_stack and not interesting_element_depth: element_stack[-1].remove(elem) del context

+1

Mike brown Jun 12 '17 at 22:26

source share

I ran into the same problem. Documentation does not make things very clear. The problem in my case is:

1) The clear call frees up memory for child nodes. The documentation says that it frees up all the memory. Clear does not free the memory for which clear is called, because that memory belongs to the parent object that allocated it. 2) A call to root.clear (), which depends on which root. If root is the parent, it will work. Otherwise, it will not free memory.

The fix was to keep the reference to the parent, and when we no longer need the node, we call parent.remove (child_node). This worked, and he saved the memory profile in several KB.

0

Kevin guerra Mar 21 '16 at 10:52

source share

wberry · Accepted Answer · 2012-04-09T19:26:35+0000

You clear enter each element, but the links to them remain in the root document. Thus, individual items still cannot be garbage collected. See this discussion in the ElementTree documentation.

The solution is to clear the links in the root, for example:

 # get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()

Another thing to remember about memory usage, which might not affect your situation, is that when the VM allocates memory to store the heap from the system, it usually never returns that memory. Most Java virtual machines work the same way. Therefore, you should not expect that the size of the interpreter in top or ps will ever decrease, even if this heap memory is not used.

Should memory usage increase when using ElementTree.iterparse () when clearing () trees?

More articles: