Decoding numeric html objects via PHP

Question

Decoding numeric html objects via PHP

I have this code to decode numeric html objects to the equivalent UTF8 character.

I am trying to convert this character:

& # 146;

which should be output:

However, it just disappears (no output). (I checked the source code of the page, the page has the correct headers / meta tags for utf8 characters).

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") { $string = html_entity_decode($string, $quote_style, $charset); $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string); $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string); //this is another method, which also doesn't work.. //$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string); return $string; } function chr_utf8_callback($matches) { return chr_utf8(hexdec($matches[1])); } function chr_utf8($num) { if ($num < 128) return chr($num); if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128); if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128); if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128); return ''; } function entity_decode_callback($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); } echo '=' . entity_decode('&#146;');

+3

html php utf-8 character-encoding

Wesley Mar 6 '12 at 16:26

source share

1 answer

hakre · Accepted Answer · 2012-03-06T16:30:39+0000

html_entity_decode already does what you are looking for:

 $string = '&#146;'; echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

He will return the symbol:

 ' binary hex: c292

What is PRIVATE USE OF TWO (U + 0092) . As a private use, your PHP configuration / version / compilation may not return it at all.

There are also a few other quirks:

But in HTML (other than XHTML, which uses XML rules), this is a long-standing browser approach that refers to characters ranging from  to  , to denote characters associated with bytes 128 through 159 in the Windows code page (cp1252) instead of Unicode characters with these code points. The HTML5 standard finally documents this behavior.

See: & # 146; converted to "\ u0092" nokogiri in ruby on rails

Decoding numeric html objects via PHP

More articles: