I am using domDocument to parse some HTML code and want to replace breaks \ n. However, I am having tro...">

DomDocument - Position Determination <br/">

I am using domDocument to parse some HTML code and want to replace breaks \ n. However, I am having trouble deciding where in the document the break really occurs.

Given the following HTML snippet - from a much larger file that I am reading using $ dom-> loadHTMLFile ($ pFilename):

<p>Multiple-line paragraph<br />that has a close tag</p> 

and my code is:

 foreach ($dom->getElementsByTagName('*') as $domElement) { switch (strtolower($domElement->nodeName)) { case 'p' : $str = (string) $domElement->nodeValue; echo 'PARAGRAPH: ',$str,PHP_EOL; break; case 'br' : echo 'BREAK: ',PHP_EOL; break; } } 

I get:

 PARAGRAPH: Multiple-line paragraphthat has a close tag BREAK: 

How can I determine the position of this gap in a paragraph and replace it with \ n?

Or is there a better alternative than using domDocument to parse HTML, which may or may not be well-formed?

+4
source share
3 answers

You cannot get the position of an element using getElementsByTagName . You must go through the childNodes each element and process the text nodes and elements separately.

In general, you will need recursion, for example:

 function processElement(DOMNode $element){ foreach($element->childNodes as $child){ if($child instanceOf DOMText){ echo $child->nodeValue,PHP_EOL; }elseif($child instanceOf DOMElement){ switch($child->nodeName){ case 'br': echo 'BREAK: ',PHP_EOL; break; case 'p': echo 'PARAGRAPH: ',PHP_EOL; processElement($child); echo 'END OF PARAGRAPH;',PHP_EOL; break; // etc. // other cases: default: processElement($child); } } } } $D = new DOMDocument; $D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>'); processElement($D); 

This will output:

 PARAGRAPH: Multiple-line paragraph BREAK: that has a close tag END OF PARAGRAPH; 
+8
source

Since you do not need to deal with child nodes and other things, why not just replace br?

 $str = '<p>Multiple-line paragraph<br />that has<br>a close tag</p>'; echo preg_replace('/<br\s*\/?>/', "\n", $str); 

output:

 <p>Multiple-line paragraph that has a close tag</p> 

Alternative (using Dom):

 $str = '<p>Multiple-line<BR>paragraph<br />that<BR/>has<br>a close<Br>tag</p>'; $dom = new DomDocument(); $dom->loadHtml($str); // using xpath here, because it will find every br-tag regardless // of it being self-closing or not $xpath = new DomXpath($dom); foreach ($xpath->query('//br') as $br) { $br->parentNode->replaceChild($dom->createTextNode("\n"), $br); } // output whole html echo $dom->saveHtml(); // or just the body child-nodes $output = ''; foreach ($xpath->query('//body/*') as $bodyChild) { $output .= $dom->saveXml($bodyChild); } echo $output; 
+2
source

I wrote a simple class that does not use recursion and should be faster / consume less memory, but basically the same primitive idea as @Hrant Khachatrian (iterate over all elements and look for child tags):

 class DomScParser { public static function find(DOMNode &$parent_node, $tag_name) { //Check if we already got self-contained node if (!$parent_node->childNodes->length) { if ($parent_node->nodeName == $tag_name) { return $parent_node; } } //Initialize path array $dom_path = array($parent_node->firstChild); //Initialize found nodes array $found_dom_arr = array(); //Iterate while we have elements in path while ($dom_path_size = count($dom_path)) { //Get last elemant in path $current_node = end($dom_path); //If it is an empty element - nothing to do here, //we should step back in our path. if (!$current_node) { array_pop($dom_path); continue; } if ($current_node->firstChild) { //If node has children - add it first child to end of path. //As we are looking for self-contained nodes without children, //this node is not what we are looking for - change corresponding //path elament to his sibling. $dom_path[] = $current_node->firstChild; $dom_path[$dom_path_size - 1] = $current_node->nextSibling; } else { //Check if we found correct node, if not - change corresponding //path elament to his sibling. if ($current_node->nodeName == $tag_name) { $found_dom_arr[] = $current_node; } $dom_path[$dom_path_size - 1] = $current_node->nextSibling; } } return $found_dom_arr; } public static function replace(DOMNode &$parent_node, $search_tag_name, $replace_tag) { //Check if we got Node to replace found node or just some text. if (!$replace_tag instanceof DOMNode) { //Get DomDocument object if ($parent_node instanceof DOMDocument) { $dom = $parent_node; } else { $dom = $parent_node->ownerDocument; } $replace_tag=$dom->createTextNode($replace_tag); } $found_tags = self::find($parent_node, $search_tag_name); foreach ($found_tags AS &$found_tag) { $found_tag->parentNode->replaceChild($replace_tag->cloneNode(),$found_tag); } } } $D = new DOMDocument; $D->loadHTML('<span>test1<br />test2</span>'); DomScParser::replace($D, 'br', "\n"); 

PS Also, it does not break several nested tags, since it does not use recursion. Html example:

 $html=str_repeat('<b>',100).'<br />'.str_repeat('</b>',100); 
+1
source

Source: https://habr.com/ru/post/1387871/


All Articles