Should memory usage increase when using ElementTree.iterparse () when clearing () trees?

import os import xml.etree.ElementTree as et for ev, el in et.iterparse(os.sys.stdin): el.clear() 

The implementation of the above in the ODP structure of the RDF dump leads to a constant increase in memory. Why is this? I understand that ElementTree is still building a parse tree, albeit with clear() ed child nodes. If this is the reason for this memory usage pattern, is there a way around it?

+6
source share
3 answers

You clear enter each element, but the links to them remain in the root document. Thus, individual items still cannot be garbage collected. See this discussion in the ElementTree documentation.

The solution is to clear the links in the root, for example:

 # get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 

Another thing to remember about memory usage, which might not affect your situation, is that when the VM allocates memory to store the heap from the system, it usually never returns that memory. Most Java virtual machines work the same way. Therefore, you should not expect that the size of the interpreter in top or ps will ever decrease, even if this heap memory is not used.

+8
source

As mentioned in Kevin Guerra's answer, the "root.clear ()" strategy in the ElementTree documentation removes the fully parsed children of the root. If these children fasten huge branches, it is not very useful.

He touched on the perfect solution, but did not add any code, so here is an example:

 element_stack = [] context = ET.iterparse(stream, events=('start', 'end')) for event, elem in context: if event == 'start': element_stack.append(elem) elif event == 'end': element_stack.pop() # see if elem is one of interest and do something with it here if element_stack: element_stack[-1].remove(elem) del context 

The item of interest will not have sub-elements; they will be deleted as soon as their end tags are visible. It might be OK if all you need is element text or attributes.

If you want to query the descendants of an element, you need to create a full branch for it. To do this, maintain a flag implemented as a depth counter for these elements. Only call .remove () when the depth is zero:

 element_stack = [] interesting_element_depth = 0 context = ET.iterparse(stream, events=('start', 'end')) for event, elem in context: if event == 'start': element_stack.append(elem) if elem.tag == 'foo': interesting_element_depth += 1 elif event == 'end': element_stack.pop() if elem.tag == 'foo': interesting_element_depth -= 1 # do something with elem and its descendants here if element_stack and not interesting_element_depth: element_stack[-1].remove(elem) del context 
+1
source

I ran into the same problem. Documentation does not make things very clear. The problem in my case is:

1) The clear call frees up memory for child nodes. The documentation says that it frees up all the memory. Clear does not free the memory for which clear is called, because that memory belongs to the parent object that allocated it. 2) A call to root.clear (), which depends on which root. If root is the parent, it will work. Otherwise, it will not free memory.

The fix was to keep the reference to the parent, and when we no longer need the node, we call parent.remove (child_node). This worked, and he saved the memory profile in several KB.

0
source

Source: https://habr.com/ru/post/912753/


All Articles