I had a problem similar to this question :
nodeValue from DomDocument returning weird characters in PHP
The root cause I found can be mimicked with mb_convert_encoding ()
In my unit tests, this finally caught the problem:
$test = mb_convert_encoding('é', "UTF-8"); $this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8'); $this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8'); $this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');
The unusual value of UTF-8 data seems to be drawing to a close, and the base code page of the system PHP is running on is most likely not UTF-8.
Until complete parsing (with the implementation of HTML5lib, which is reset to DOMDocument), the lines remain clean, UTF-8 is friendly. Only when sending data using
$span->nodeValue
I see a failure in coding stability.
My hunch is that htmlentities will be caught to export domdocument to nodeValue uses an encoding converter, but ignores the value of inline encoding.
Given that my problem is with HTML5, I thought it would be directly related to the novelty of the implementation, but it seems to be a broader problem. I could not find any information on this issue related to the DOMDocument by searching except for the question mentioned at the beginning.
UPDATE
In the name of moving forward, I switched from HTML5lib and DOMDocument to Simple HTML DOM , and it exports a purely shielded html which I can then parse back to the correct UTF-8 objects.
Also, one feature that I have not tried was
utf8_decode
So this can be a solution for everyone who is faced with this problem. He solved the problem with AJAX / PHP, the solution found in this blog post from 2009: Overcoming AJaX UTF-8 encoding restrictions (in PHP)