How to decode such a strange string in UTF-8? (Php)

Question

How to decode such a strange string in UTF-8? (Php)

So, I have %u041E%u043B%u0435%u0433%20%u042F%u043A how to save it in real UTF-8 or (is it better for me for HTML objects)?

+4

php utf-8 decode encode

Rella May 18, '10 at 18:41

source share

3 answers

PHP has a decoding function

 $string = html_entity_decode($string,ENT_COMPAT,"UTF-8")

+2

Geek num 88 May 18, '10 at 18:49

source share

As suggested by others, convert it to Unicode HTML Entities. This is the regex that I use,

 function escapePercentU($s) { $s = preg_replace( "/%u([A-Fa-f0-9]{4})/", "&#x$1;", $s); return html_entity_decode($s, ENT_COMPAT, 'utf-8'); }

0

Zz coder May 18, '10 at 19:39

source share

bobince · Accepted Answer · 2010-05-18T18:54:58+0000

This JavaScript format is escape() . It is similar to URL coding, but not compatible. This is usually a mistake.

It is best to modify the script that generates it, instead use the correct URL encoding ( encodeURIComponent() ). You can then decode it using urldecode or any other normal server-side URL decoding function.

If you absolutely need to exchange data in this non-standard format, you will have to write your own decoder for it. Here's a quick hack using an HTML character decoder:

 function jsunescape($s) { $s= preg_replace('/%u(....)/', '&#x$1;', $s); $s= preg_replace('/%(..)/', '&#x$1;', $s); return html_entity_decode($s, ENT_COMPAT, 'utf-8'); }

Returns the raw byte string of UTF-8. If you really want in HTML character characters like Ру... , leave the html_entity_decode call. But usually this does not happen. It is best to store the strings in raw format until they are escaped for final output — and it’s best not to replace non-ASCII characters with characters at all, if you really don't need to.

what if I like some line "% CE% EB% E5% E3 +% DF% EA% F3% F8% EA% E8% ED '

This format is a URL form that is not compatible with the escape() format. While the URL encoding of double-digit bytes is different from the crazy escape formatted 4-digit code unit-screens, the + character is ambiguous. This can mean a plus (if the string is from escape ) or a space (if it is from the browser view). It is impossible to say what it is. This is another reason not to use escape() .

Moreover; if the encoding of this string was UTF-8, then yes, the above function would be great, turning both URL encoded bytes and crazy escape() -format Unicode characters into raw UTF-8 bytes.

However, in reality this is apparently code page 1251 (Windows Russian). Are you sure you want to process all your lines in cp1251? If so, you will have to modify it a bit to encode the four-digit screens into a different encoding. This is dirty:

 function url_or_maybe_jsescape_decode($s, $charset, $isform) { if ($isform) $s= str_replace('+', ' ', $s); $s= preg_replace('/%u(....)/', '&#x$1;', $s); $s= preg_replace('/%(..)/', '&!#x$1;', $s); $s= html_entity_decode($s, ENT_COMPAT, $charset); $s= str_replace('&!', '&', $s); $s= html_entity_decode($s, ENT_COMPAT, 'utf-8'); return $s; } echo url_or_maybe_jsescape_decode('%CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED', 'cp1251', TRUE);

I would highly recommend:

fixing the Flash file so that it uses the correct encodeURIComponent rather than escape , so you can use the standard URL decoder instead of this ugly hack.
using UTF-8, and not completely through your application, so you can support languages other than Russian, and you do not need to worry about the input encoding of the submitted forms changing.

(All encodings that do not meet the requirements of UTF-8, and that FACT is proved by SCIENCE!)

How to decode such a strange string in UTF-8? (Php)

More articles: