How to decode such a strange string in UTF-8? (Php)

So, I have %u041E%u043B%u0435%u0433%20%u042F%u043A how to save it in real UTF-8 or (is it better for me for HTML objects)?

+4
source share
3 answers

This JavaScript format is escape() . It is similar to URL coding, but not compatible. This is usually a mistake.

It is best to modify the script that generates it, instead use the correct URL encoding ( encodeURIComponent() ). You can then decode it using urldecode or any other normal server-side URL decoding function.

If you absolutely need to exchange data in this non-standard format, you will have to write your own decoder for it. Here's a quick hack using an HTML character decoder:

 function jsunescape($s) { $s= preg_replace('/%u(....)/', '&#x$1;', $s); $s= preg_replace('/%(..)/', '&#x$1;', $s); return html_entity_decode($s, ENT_COMPAT, 'utf-8'); } 

Returns the raw byte string of UTF-8. If you really want in HTML character characters like Ру... , leave the html_entity_decode call. But usually this does not happen. It is best to store the strings in raw format until they are escaped for final output β€” and it’s best not to replace non-ASCII characters with characters at all, if you really don't need to.

what if I like some line "% CE% EB% E5% E3 +% DF% EA% F3% F8% EA% E8% ED '

This format is a URL form that is not compatible with the escape() format. While the URL encoding of double-digit bytes is different from the crazy escape formatted 4-digit code unit-screens, the + character is ambiguous. This can mean a plus (if the string is from escape ) or a space (if it is from the browser view). It is impossible to say what it is. This is another reason not to use escape() .

Moreover; if the encoding of this string was UTF-8, then yes, the above function would be great, turning both URL encoded bytes and crazy escape() -format Unicode characters into raw UTF-8 bytes.

However, in reality this is apparently code page 1251 (Windows Russian). Are you sure you want to process all your lines in cp1251? If so, you will have to modify it a bit to encode the four-digit screens into a different encoding. This is dirty:

 function url_or_maybe_jsescape_decode($s, $charset, $isform) { if ($isform) $s= str_replace('+', ' ', $s); $s= preg_replace('/%u(....)/', '&#x$1;', $s); $s= preg_replace('/%(..)/', '&!#x$1;', $s); $s= html_entity_decode($s, ENT_COMPAT, $charset); $s= str_replace('&!', '&', $s); $s= html_entity_decode($s, ENT_COMPAT, 'utf-8'); return $s; } echo url_or_maybe_jsescape_decode('%CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED', 'cp1251', TRUE); 

I would highly recommend:

  • fixing the Flash file so that it uses the correct encodeURIComponent rather than escape , so you can use the standard URL decoder instead of this ugly hack.

  • using UTF-8, and not completely through your application, so you can support languages ​​other than Russian, and you do not need to worry about the input encoding of the submitted forms changing.

(All encodings that do not meet the requirements of UTF-8, and that FACT is proved by SCIENCE!)

+9
source

PHP has a decoding function

 $string = html_entity_decode($string,ENT_COMPAT,"UTF-8") 
+2
source

As suggested by others, convert it to Unicode HTML Entities. This is the regex that I use,

 function escapePercentU($s) { $s = preg_replace( "/%u([A-Fa-f0-9]{4})/", "&#x$1;", $s); return html_entity_decode($s, ENT_COMPAT, 'utf-8'); } 
0
source

Source: https://habr.com/ru/post/1310148/


All Articles