Is string saving interrupted inside <p> tags with DOMXPath?
I am currently using PHP and DOMXPath to retrieve the contents of all <p> elements of a web page:
<?php ... $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; } My problem is that the string resulting from textContent does not support the <br /> tags that exist inside these <p> elements. Instead, it removes line breaks and compresses the words together, which are usually on separate lines. For instance:
HTML example:
<p> Some happy talk goes here talking about our great product.<br /> We would love for you to buy it! </p> <p> Random information and what not<br /> Isn't that cool? </p> Current output from PHP above:
Some happy talk about our great product.We would love for you to buy it! Random information and what notIsn't that cool? I also tried $paragraphs = $doc->getElementsByTagName("p"); and he gives me the same thing.
Is there a way to force DOMXPath / DOMDocument to keep line breaks? I need to be able to separate each of the words in a paragraph, and the current output forbids this.
If there is an alternative method for extracting a string inside <p> elements while saving <br /> or '\n' , that would also be great.
EDIT
After further study, the HTML in question is actually a list of anchors separated by <br> tags, but without actual line breaks:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p> It turns out that this works correctly with the HTML source code.
UPDATE: Allowed
With the help of @ircmaxell's answer and the comments left by @netcoder and @Gordon, this was resolved, it is not very elegant, but now it will do it.
Example:
foreach ($paragraphs as $paragraph){ $p_text = new DOMDocument(); $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph))); //Do whatever, in this case get all of the words in an array. $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent)); print_r($words); } This uses DOMinnerHTML (as suggested by @netcoder) to replace <br> instances with "\ r \ n" (as suggested by @ircmaxell), which can then be evaluated by post textContent.
Obviously there is room for improvement, but it solved my current problem.
Thanks for helping everyone,
Ben
Well, what would I do is replace line breaks with literal lines:
$doc = new DOMDocument(); $doc->loadHTML($html); $brs = $doc->getElementsByTagName('br'); foreach ($brs as $node) { $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node); } $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; }