Is string saving interrupted inside <p> tags with DOMXPath?

I am currently using PHP and DOMXPath to retrieve the contents of all <p> elements of a web page:

 <?php ... $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; } 

My problem is that the string resulting from textContent does not support the <br /> tags that exist inside these <p> elements. Instead, it removes line breaks and compresses the words together, which are usually on separate lines. For instance:

HTML example:

 <p> Some happy talk goes here talking about our great product.<br /> We would love for you to buy it! </p> <p> Random information and what not<br /> Isn't that cool? </p> 

Current output from PHP above:

 Some happy talk about our great product.We would love for you to buy it! Random information and what notIsn't that cool? 

I also tried $paragraphs = $doc->getElementsByTagName("p"); and he gives me the same thing.

Is there a way to force DOMXPath / DOMDocument to keep line breaks? I need to be able to separate each of the words in a paragraph, and the current output forbids this.

If there is an alternative method for extracting a string inside <p> elements while saving <br /> or '\n' , that would also be great.

EDIT


After further study, the HTML in question is actually a list of anchors separated by <br> tags, but without actual line breaks:

 <p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p> 

It turns out that this works correctly with the HTML source code.

UPDATE: Allowed


With the help of @ircmaxell's answer and the comments left by @netcoder and @Gordon, this was resolved, it is not very elegant, but now it will do it.

Example:

 foreach ($paragraphs as $paragraph){ $p_text = new DOMDocument(); $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph))); //Do whatever, in this case get all of the words in an array. $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent)); print_r($words); } 

This uses DOMinnerHTML (as suggested by @netcoder) to replace <br> instances with "\ r \ n" (as suggested by @ircmaxell), which can then be evaluated by post textContent.

Obviously there is room for improvement, but it solved my current problem.

Thanks for helping everyone,

Ben

+4
source share
3 answers

Well, what would I do is replace line breaks with literal lines:

 $doc = new DOMDocument(); $doc->loadHTML($html); $brs = $doc->getElementsByTagName('br'); foreach ($brs as $node) { $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node); } $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; } 
+4
source

One of the possibilities

 echo simplexml_import_dom($paragraph)->asXML(); 
+2
source

I have the same situation, I use:

 $document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file)); 

And I use urlencode () to change it to display or paste into the database.

+1
source

Source: https://habr.com/ru/post/1336201/


All Articles