Html_entity_decode - character encoding problem

Question

Html_entity_decode - character encoding problem

I'm having problems with character encoding. I simplified this below script:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <?php $string = 'Stan&#146;s'; echo $string.'<br><br>'; // Stan's echo html_entity_decode($string).'<br><br>'; // Stan's echo html_entity_decode($string, ENT_QUOTES, 'UTF-8'); // Stans ?> </body> </html>

I would like to use the latest echo . However, he removes ' why?

Update

I tried all three options ENT_COMPAT , ENT_QUOTES , ENT_NOQUOTES and removes ' in all cases.

+3

php

Abs Aug 21 '11 at 11:30

source share

1 answer

deceze · Accepted Answer · 2011-08-21T11:44:03+0000

The problem is that  decodes the Unicode character U + 0092, UTF-8 C2 92 , known as TWO PRIVATE USE:

 $ php test.php | xxd 0000000: 5374 616e c292 73 Stan..s

Ie, this is not decoded by a regular apostrophe.

html_entity_decode($string) works because it does not actually decode the object, since the default target encoding is Latin-1, which cannot represent this character. If you specify UTF-8 as the target encoding, the object is actually decoded.

The purpose of this object is to encode Windows-1252:

 echo iconv('cp1252', 'UTF-8', html_entity_decode('Stan&#146;s', ENT_QUOTES, 'cp1252')); Stan's

Wikipedia Quote:

Numeric references always refer to Unicode code points, regardless of page encoding. The use of numeric links that refer to permanently undefined characters and control characters is prohibited, with the exception of line feeds, tabs, and carriage returns. That is, characters in the hexadecimal ranges 00-08, 0B-0C, 0E-1F, 7F and 80-9F cannot be used in an HTML document, not even by reference, therefore  , for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the range 80-9F are interpreted by some browsers as representing characters mapped to bytes 80-9F in Windows- 1252.

So, you are dealing with obsolete HTML objects here, which, apparently, PHP does not handle the same as "some" browsers. You can check whether the decoded objects are in the above range, that you encode / reduce them to Windows-1252, and then convert to UTF-8. Or ask your users to submit valid HTML.

This function should handle both obsolete and regular HTML objects:

 function legacy_html_entity_decode($str, $quotes = ENT_QUOTES, $charset = 'UTF-8') { return preg_replace_callback('/&#(\d+);/', function ($m) use ($quotes, $charset) { if (0x80 <= $m[1] && $m[1] <= 0x9F) { return iconv('cp1252', $charset, html_entity_decode($m[0], $quotes, 'cp1252')); } return html_entity_decode($m[0], $quotes, $charset); }, $str); }

Html_entity_decode - character encoding problem

Update

More articles: