UTF8 file names in PHP and various Unicode encodings

I have a file containing Unicode characters on a server with Linux. If I use SSH on the server and use tab-completion to navigate to a file / folder containing Unicode characters, I have no problem accessing the file / folder. The problem occurs when I try to access a file through PHP (the function that I accessed on the file system was stat). If I output the path generated by the PHP script to the browser and paste it into the terminal, the file also seems to exist (even if you look at the terminal, the file paths are exactly the same).

I am installing PHP to use UTF8 as its default encoding through php_ini, as well as for installation mb_internal_encoding. I checked the string encoding with the PHP file file and it comes out as UTF8, as you would expect. hexdumpAfter thinking a little more, I decided the symbol é, that the terminal tab is terminated and compare it with the hexdump"regular" character created by the PHP script, or by manually entering the character through the keyboard (option + e + e on os x). Here is the result:

echo -n é | hexdump
0000000 cc65 0081                              
0000003
echo -n é | hexdump
0000000 a9c3                                   
0000002

The é character, which allows a correct link to a file in the terminal, is 3 bytes. I'm not sure where to go from here, what encoding should I use in PHP? Should I convert the path to another encoding through iconvor mb_convert_encoding?

+3
source share
3 answers

Thanks to the tips in the two answers, I was able to poke and find some methods to normalize the different Unicode decompositions of this character. In the situation that I encountered, I turned to the files created by the OS X Carbon application. This is a fairly popular application, and therefore its file names seem to correspond to a specific Unicode decomposition.

PHP 5.3 , . -, , unicode. Python Unicode 2.3 unicode.normalize. python / .

unicode:

filePath = unicodedata.normalize('NFD', filePath)

, NFD , , Unicode.

+4

- utf8 e (0x65), combining '(0xcc 0x81), 0xc3 0xa9 "" é.
utf-8 , , (, , php) Mac.
, , " UTF-8 Gentoo" .

+3

-: . , PHP , , .

( ) é UTF-8 . Unicode . Unicode "canonicalisation", , , .

Linux , (, ) (, ) , , . , (, ) . - .

+1

Source: https://habr.com/ru/post/1711955/


All Articles