PHP DOMNode: how to extract not only text but also HTML tags

Question

PHP DOMNode: how to extract not only text but also HTML tags

I am trying to make a script that deletes a website to get the latest news updates. Unfortunately, I am having a small problem that I seem to be unable to fix due to my limited knowledge of the DOM.

The page I'm trying to clear is constructed as follows:

<table> <tr class="color1"> <td>Author</td> <td>Content <a href="#">in HTML</a></td> <td>Date</td> </tr> </table>

I can get the fields I need, except for the content. With $ td-> nodeValue I retrieve the content in text form, whereas I want it in HTML (there are "tags", "blockquote", etc.)

Here is the code I have:

 try { $html = @ file_get_contents("test.php"); checkIfFileExists($html); $dom = new DOMDocument(); @ $dom->loadHTML($html); $trNodes = $dom->getElementsByTagName("tr"); foreach ($trNodes as $tr) { if ($tr->getAttribute("class") == "color1" || $tr->getAttribute("class") == "color2") { $tdNodes = $tr->childNodes; foreach ($tdNodes as $td) { echo $td->nodeValue . "<br />\n"; } echo "<br /><br /><br /><br /><br />\n"; } } catch(Exception $e) { echo $e->getMessage(); }

I would prefer not to resort to any third-party library, but, obviously, any answer is most valuable, library or not.

Thanks in advance.

+6

dom php screen-scraping

Steven Jun 07 '11 at 7:36

source share

1 answer

Frederic bazin · Accepted Answer · 2011-06-07T07:43:42+0000

replace

 echo $td->nodeValue . "<br />\n";

from

 echo $dom->saveXML($td) . "<br />\n";

PHP DOMNode: how to extract not only text but also HTML tags

More articles: