DOMDocument / Xpath memory leak during a long command line process - any way to deconstruct this class

I have a php scraping command line application that uses XPath to parse HTML - a problem occurs every time a new instance of the DOMXPath class is loaded in a loop. I get a memory loss roughly equal to the size of the XML being loaded. The script starts and runs, slowly increasing the amount of memory until it reaches the limit and completes.

I tried to force garbage collection with gc_collect_cycles() , and PHP still does not return memory from old Xpath requests. Indeed, does the definition of the DOMXPath class not even include the destructor function?

So my question is ... is there a way to forcefully remove garbage on DOMXPath after I have already extracted the necessary data? Using unset in an instance of a class predictably does nothing.

The code is nothing special, just the standard Xpath stuff:

 //Loaded outside of loop $this->dom = new DOMDocument(); //Inside Loop $this->dom->loadHTML($output); $xpath = new DOMXPath($this->dom); $nodes = $xpath->query("//span[@class='ckass']"); //unset($this->dom) and unset($xpath) doesn't seem to have any effect 

As you can see above, I saved an instance of the new DOMDocument class outside the loop, although this does not seem to improve performance. I even tried to take an instance of the $xpath class out of the loop and load the DOM into the Xpath directly using the __constructor method, the memory loss is the same.

+4
source share
2 answers

Seeing this answer, she for many years without a conclusion, finally an update! Now I am faced with a similar problem, and it turned out that DOMXPath just a memory leak, and you cannot control it. I did not search if this was reported on bug.php.net (this may be useful for editing later).

The “working” solutions that I found in this problem are workarounds. The main idea was to replace the DOMNodeList Traversable returned by DOMXPath::query() with another containing the same nodes.

The most suitable approach is DOMXPathElementsIterator , which allows you to query for the specific xpath expression that you have in your question, without leak memory:

 $nodes = new DOMXPathElementsIterator($this->dom, "//span[@class='ckass']"); foreach ($nodes as $span) { ... } 

This class is now part of the Iterator-Garden development version and $nodes is an iterator over all <span> DOMElements.

The disadvantage of this workaround is that the result of xpath is limited to the result of SimpleXMLElement::xpath() (this is different from DOMXPath::query() ) because it is used to prevent memory leaks.

Another alternative is to use the DOMNodeListIterator over the DOMNodeList as the one returned by DOMDocument::getElementsByTagname() . However, these iterations are slow.

Hope this is helpful, even the question was really old. It helped me in a similar situation.


Calling garbage collector cleaning circles makes sense if objects are no longer referenced (used).

For example, if you create a new DOMXPath object for the same DOMDocument again and again (remember that it is connected to a DOMDocument that still exists), it looks like your memory is “leak”. You just use more and more memory.

Instead, you can simply reuse an existing DOMXPath object while reusing the DOMDocument object all the time. Try:

 //Loaded outside of loop $this->dom = new DOMDocument(); $xpath = new DOMXPath($this->dom); //Inside Loop $this->dom->loadHTML($output); $nodes = $xpath->query("//span[@class='ckass']"); 
+1
source

If you use libxml_use_internal_errors(true); than the cause of a memory leak because the list of errors is growing.

Use libxml_clear_errors(); or check this answer for details.

0
source

Source: https://habr.com/ru/post/903067/


All Articles