Why is the DOM changing the encoding?

Question

Why is the DOM changing the encoding?

$string = file_get_contents('http://example.com'); if ('UTF-8' === mb_detect_encoding($string)) { $dom = new DOMDocument(); // hack to preserve UTF-8 characters $dom->loadHTML('<?xml encoding="UTF-8">' . $string); $dom->preserveWhiteSpace = false; $dom->encoding = 'UTF-8'; $body = $dom->getElementsByTagName('body'); echo htmlspecialchars($body->item(0)->nodeValue); }

This changes all UTF-8 characters to Å, ¾, ¤ and other garbage. Is there any other way to save UTF-8 characters?

Do not post answers telling me to make sure I output it as UTF-8, I made sure that I am.

Thank you in advance:)

+20

dom php utf-8

Richard Knop Feb 10 2018-10-10

source share

4 answers

In case the DOM twists the encoding, this trick did this for me some time ago (accepting the data of ISO-8859-1). DOMDocument should be UTF-8 by default anyway, but you can still try:

  $dom = new DOMDocument('1.0', 'utf-8');

+4

Pekka 웃 Feb 10 2018-10-10

source share

At the top of the script where your PHP code is located (the code you posted here), make sure you send the utf-8 header. I bet your encoding is some variant of latin1 right now. Yes, I know that the remote webpage is utf8, but this PHP script is not.

+1

goat Feb 10 2018-10-10

source share

I needed to add the utf8 header to get the correct view:

 header('Content-Type: text/html; charset=utf-8');

0

fty4 Jan 06 '18 at 19:12

source share

andrewmabbott · Accepted Answer · 2010-02-10 15:48

I had similar problems lately, and in the end I found this workaround - I convert all characters without ascii to html objects before loading html

 $string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8"); $dom->loadHTML($string);

Why is the DOM changing the encoding?

More articles: