PHP DOMXPath removes my tags inside matched text

I asked this question yesterday, and at that time it was exactly what I needed, but while working with some live data, I found that this was not quite what I expected. Parse HTML with HTML DOMDocument PHP

It gets the data from the HTML page, but then also removes all the HTML tags inside the captured block of text, which I don't want. (I could not use some tags, but not all, and this can be done later)

+4
source share
2 answers

This is a common problem with the DOM: you need to do a bit more work if you want to get the contents of the tag and the contents of all its children.

Basically, you need to iterate over the child nodes of the one you mapped to your XPath query to get their contents.

There is one solution suggested in one of the user's notes on the DOMElement class manual page - see this note .


Integrating this solution into the code that you already have should give you something similar to declaring an HTML line with sub-labels:

 $html = <<<HTML <div class="main"> <div class="text"> <p> Capture this <strong>text</strong> <em>1</em> </p> <p> And some other <strong>text</strong> </p> </div> </div> HTML; 


And, to extract data from this HTML string, you can use something like this:

 $dom = new DOMDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $tags = $xpath->query('//div[@class="main"]/div[@class="text"]'); foreach ($tags as $tag) { $innerHTML = ''; // see http://fr.php.net/manual/en/class.domelement.php#86803 $children = $tag->childNodes; foreach ($children as $child) { $tmp_doc = new DOMDocument(); $tmp_doc->appendChild($tmp_doc->importNode($child,true)); $innerHTML .= $tmp_doc->saveHTML(); } var_dump(trim($innerHTML)); } 

The only thing that has changed is the contents of the foreach : instead of using $tag->nodeValue you need to $tag->nodeValue over the children.


This gives me the following result:

 string '<p> Capture this <strong>text</strong> <em>1</em> </p> <p> And some other <strong>text</strong> </p>' (length=150) 

What is the full content of the <div> tag that was matched, and all its children, including tags.


Note: in the notes of users of the manual there are often interesting ideas and solutions; -)

+8
source

Pascal MARTIN's answer is great, but I found that it can be simplified

 $dom = new DOMDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $tags = $xpath->query('//div[@class="main"]/div[@class="text"]'); foreach ($tags as $tag) { $innerHTML = ''; $children = $tag->childNodes; foreach ($children as $child) { $innerHTML .= $dom->saveHTML($child); } var_dump(trim($innerHTML)); } 

This method seems to give the same result, but does not require creating new DomDocument objects inside the foreach .

EDIT:

So, after further experimentation, you can actually reduce this:

 $dom = new DOMDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $tags = $xpath->query('//div[@class="main"]/div[@class="text"]'); foreach ($tags as $tag) { var_dump(trim($dom->saveHTML($tag))); } 
+1
source

Source: https://habr.com/ru/post/1305968/


All Articles