NodeValue from DomDocument returning weird characters in PHP

So, I'm trying to parse HTML pages and look for paragraphs ( <p> ) with get_elements_by_tag_name('p');

The problem is that when I use $element->nodeValue , it returns strange characters. The document is loaded first into $ html using curl, and then loaded into DomDocument.

I am sure that this is due to encodings.

Here is an example response: "aujourdà ¢  €  ™ hui".

Thanks in advance.

+1
source share
3 answers

I fixed this by forcing conversion to UTF-8, although the source code was UTF-8:

 $text = iconv("UTF-8", "UTF-8", $text); $dom = new SmartDOMDocument(); $dom->loadHTML($webpage, 'UTF-8'); . . echo $node->nodeValue; 

PHP wierd :)

+3
source

I had the same problems and now I noticed that loadHTML () no longer accepts 2 parameters, so I had to find another solution. Using the following function in my DOM library, I was able to remove funky characters from my HTML content.

 private static function load_html($html) { $doc = new DOMDocument; $doc->loadHTML('<?xml encoding="UTF-8">' . $html); foreach ($doc->childNodes as $node) if ($node->nodeType == XML_PI_NODE) $doc->removeChild($node); $doc->encoding = 'UTF-8'; return $doc; } 
+3
source

This is an encoding problem. try explicitly setting the encoding to UTF-8.

this should help: http://devzone.zend.com/article/8855

+1
source

Source: https://habr.com/ru/post/1342189/


All Articles