Is string saving interrupted inside tags with DOMXPath?

Question

Is string saving interrupted inside tags with DOMXPath?

I am currently using PHP and DOMXPath to retrieve the contents of all  elements of a web page:

 <?php ... $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; }

My problem is that the string resulting from textContent does not support the   tags that exist inside these  elements. Instead, it removes line breaks and compresses the words together, which are usually on separate lines. For instance:

HTML example:

 <p> Some happy talk goes here talking about our great product.<br /> We would love for you to buy it! </p> <p> Random information and what not<br /> Isn't that cool? </p>

Current output from PHP above:

 Some happy talk about our great product.We would love for you to buy it! Random information and what notIsn't that cool?

I also tried $paragraphs = $doc->getElementsByTagName("p"); and he gives me the same thing.

Is there a way to force DOMXPath / DOMDocument to keep line breaks? I need to be able to separate each of the words in a paragraph, and the current output forbids this.

If there is an alternative method for extracting a string inside  elements while saving   or '\n' , that would also be great.

EDIT

After further study, the HTML in question is actually a list of anchors separated by   tags, but without actual line breaks:

 <p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

It turns out that this works correctly with the HTML source code.

UPDATE: Allowed

With the help of @ircmaxell's answer and the comments left by @netcoder and @Gordon, this was resolved, it is not very elegant, but now it will do it.

Example:

 foreach ($paragraphs as $paragraph){ $p_text = new DOMDocument(); $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph))); //Do whatever, in this case get all of the words in an array. $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent)); print_r($words); }

This uses DOMinnerHTML (as suggested by @netcoder) to replace   instances with "\ r \ n" (as suggested by @ircmaxell), which can then be evaluated by post textContent.

Obviously there is room for improvement, but it solved my current problem.

Thanks for helping everyone,

Ben

+4

dom html php xpath

Ben L. Jan 19 '11 at 19:44

source share

3 answers

One of the possibilities

 echo simplexml_import_dom($paragraph)->asXML();

+2

ajreal Jan 19 '11 at 20:18

source share

I have the same situation, I use:

 $document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

And I use urlencode () to change it to display or paste into the database.

+1

inMILD Jul 25 '13 at 1:58

source share

ircmaxell · Accepted Answer · 2011-01-19T20:19:20+0000

Well, what would I do is replace line breaks with literal lines:

 $doc = new DOMDocument(); $doc->loadHTML($html); $brs = $doc->getElementsByTagName('br'); foreach ($brs as $node) { $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node); } $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; }

Is string saving interrupted inside <p> tags with DOMXPath?

More articles: