Why is the DOM changing the encoding?

$string = file_get_contents('http://example.com'); if ('UTF-8' === mb_detect_encoding($string)) { $dom = new DOMDocument(); // hack to preserve UTF-8 characters $dom->loadHTML('<?xml encoding="UTF-8">' . $string); $dom->preserveWhiteSpace = false; $dom->encoding = 'UTF-8'; $body = $dom->getElementsByTagName('body'); echo htmlspecialchars($body->item(0)->nodeValue); } 

This changes all UTF-8 characters to ร…, ยพ, ยค and other garbage. Is there any other way to save UTF-8 characters?

Do not post answers telling me to make sure I output it as UTF-8, I made sure that I am.

Thank you in advance:)

+20
dom php utf-8
Feb 10 2018-10-10
source share
4 answers

I had similar problems lately, and in the end I found this workaround - I convert all characters without ascii to html objects before loading html

 $string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8"); $dom->loadHTML($string); 
+38
Feb 10 2018-10-10
source share

In case the DOM twists the encoding, this trick did this for me some time ago (accepting the data of ISO-8859-1). DOMDocument should be UTF-8 by default anyway, but you can still try:

  $dom = new DOMDocument('1.0', 'utf-8'); 
+4
Feb 10 2018-10-10
source share

At the top of the script where your PHP code is located (the code you posted here), make sure you send the utf-8 header. I bet your encoding is some variant of latin1 right now. Yes, I know that the remote webpage is utf8, but this PHP script is not.

+1
Feb 10 2018-10-10
source share

I needed to add the utf8 header to get the correct view:

 header('Content-Type: text/html; charset=utf-8'); 
0
Jan 06 '18 at 19:12
source share



All Articles