PHP DOMDocument nodeValue discards UTF-8 literal characters instead of encoded

Question

PHP DOMDocument nodeValue discards UTF-8 literal characters instead of encoded

I had a problem similar to this question :

nodeValue from DomDocument returning weird characters in PHP

The root cause I found can be mimicked with mb_convert_encoding ()

In my unit tests, this finally caught the problem:

$test = mb_convert_encoding('é', "UTF-8"); $this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8'); $this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8'); $this->assertIdentical($test,html_entity_decode('&Atilde;&copy;',ENT_QUOTES,'UTF-8'),'values match');

The unusual value of UTF-8 data seems to be drawing to a close, and the base code page of the system PHP is running on is most likely not UTF-8.

Until complete parsing (with the implementation of HTML5lib, which is reset to DOMDocument), the lines remain clean, UTF-8 is friendly. Only when sending data using

 $span->nodeValue

I see a failure in coding stability.

My hunch is that htmlentities will be caught to export domdocument to nodeValue uses an encoding converter, but ignores the value of inline encoding.

Given that my problem is with HTML5, I thought it would be directly related to the novelty of the implementation, but it seems to be a broader problem. I could not find any information on this issue related to the DOMDocument by searching except for the question mentioned at the beginning.

UPDATE

In the name of moving forward, I switched from HTML5lib and DOMDocument to Simple HTML DOM , and it exports a purely shielded html which I can then parse back to the correct UTF-8 objects.

Also, one feature that I have not tried was

 utf8_decode

So this can be a solution for everyone who is faced with this problem. He solved the problem with AJAX / PHP, the solution found in this blog post from 2009: Overcoming AJaX UTF-8 encoding restrictions (in PHP)

+4

php encoding utf-8 character-encoding domdocument

Dave espionage Mar 03 '11 at 20:28

source share

2 answers

Patrick · Answer 1 · 2012-05-03T09:44:13+0000

Just used utf8_decode on nodeValue, and it really worked, there was a problem with special characters that do not display correctly.

However, some characters still remain problematic, such as a simple quote 'and a few others (for example,)

So using $ element-> nodeValue will not work, but utf8_decode ($ element-> nodeValue) will be - PARTIAL.

troelskn · Answer 2 · 2012-05-03T09:57:26+0000

The utf8_decode and utf8_encode not well named. They literally convert from utf-8 to iso-8859-1 and from iso-8859-1 to utf-8 respectively.

mb_convert_encoding when called using only utf-8 as an argument will usually be similar to the utf8_encode function. (Usually, if you did not change the internal code page, which you probably did not hope so).

Most PHP functions expect strings to be iso-8859-1 encoded. However, libxml (which is the core library of php xml parsing libraries) expects the lines to be utf-8 . This way you can easily get garbled encodings if you are not careful.

As for your test, the first line can be deceiving. Since you have the literal é in your script, the test will vary depending on what encoding you saved in the file. Check out this text editor.

Hope this clarifies a bit.

PHP DOMDocument nodeValue discards UTF-8 literal characters instead of encoded

More articles: