DOMDocument breaks the encoding?

I run the following code:

$page = '<p>Γ„</p>'; $DOM = new DOMDocument; $DOM->loadHTML($page); echo 'source:'.$page; echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent; 

and displays the following:

source: Γ„

dom: Γƒ

So, I don’t understand why, when the text comes through the DOMDocument, its encoding is interrupted?

+4
php encoding domdocument
01 Oct
source share
2 answers

DOMDocument seems to handle input as UTF-8. With this transformation, Γ„ becomes Γƒβ€ž . Here's the catch: this second character does not exist in ISO-8859-1, but exists in Windows-1252. That is why you do not see the second character in your output.

You can fix this by calling utf8_decode on the output of textContent or using UTF-8 as the character encoding of the page.

+3
01 Oct. '12 at 16:17
source share

Here is a workaround that adds the correct encoding through the meta header:

 $DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page); 

I'm not sure if this is the character set you are trying to use, but adjust if necessary

See also: domdocument character set issue

+8
Oct 01 '12 at 16:17
source share



All Articles