How exactly does the program convert everything to UTF-8 internally?

  • uses setlocale ()?
  • does it accept utf-8 for all input strings when in the UTF-8 locale?
  • I understand what unicode is and how it is associated with utf-8, but how to "convert it to it" inside with all their lines?

How to convert all input strings to UTF-8? Does it use the library function C?

Should the current working locale language be UTF-8?

UPDATE : if your answer might have specific technical data, that would be great as it was more like what I'm looking for. I already understand the reasons for using UTF-8 internally and why it simplifies working with multiple locales.

UPDATE : the answer mentioned just for using iconv and / or ICU, however, how strcmp()does it know with all other programs to compare them as UTF-8? Is it being implemented setlocale()? Or does it not matter?

+3
source share
5 answers

It's a little difficult to say where to start, as there are a lot of assumptions in the game.

In C, as we know and love it, there is a <char 'data type. In all commonly used implementations, this data type contains 8-bit bytes.

In a language, unlike any library functions that you use, these things are two integers. They have no "symbolic" semantics.

"str" "is" (, strcmp, isalnum), .

C , , Unicode. . - . . if (charvalue == 'a'). - .

.

UTF-8? , . Unicode ( 32- ) , . , .

, , , , , , .

32- (UTF-32), , , . , . , gcc wchar_t 32- , Microsoft Visual Studio , , 16- (UTF-16 UCS-2, ).

, Windows C, 8- . , Unicode UTF-8, Unicode 8- . UTF-8 1 4 . ISO-646 ('ascii') , , .

UTF-8, UTF-8, lib . UTF-8, , , ICU ICONV.

, . . open(2) Windows, . UTF-8, , UTF-8.

fopen(3), , . , , Big5, , fopen, , , . ICONV ICU UTF-8 .

" ". . UTF-8 argv UTF-8. 0 UTF-8. UTF-8, setlocale UTF-8, UTF-8 argv. , , , , .

+5

-... , , libiconv ICU, ... ...

EDIT:

C, C. UTF-8 , glib ICU.

+2

- . .

, " UTF-8 " , UTF-8 , , UTF-8, , UTF-8. , .

, , , UTF-8 (.. ).

+1

ICU utf-16 ( ), utf-8. , , untailored UCA, "".

+1

C, . strcmp() - memcmp() ( ), 0. C strcmp . (CP850, UTF-8, Ansi, Windows, Mac), , , .

, , strcmp(), , , .

XML, libxml, () , XML-.

encoding / character tables is one of the worst concepts in C, starting from the old days when 7 bytes character bytes and the computer world occurred in the USA. (therefore there are no umlauts, accents, EURO-Sign, etc.)

0
source

Source: https://habr.com/ru/post/1744490/


All Articles