Removing Unicode Characters Using C / C ++

I need to output Unicode characters inside an input string in a UTF-16 or UTF-32 escape sequence. For example, the literal of the input string "Eat, drink, 愛"should be escaped as "Eat, drink, \u611b". Here are the rules in the table:

Escape | Unicode Code Point


'\ u' HEX HEX HEX HEX | A Unicode code point in the range U + 0 through U + FFFF, inclusive, corresponding to a coded hexadecimal value.


'\ U' HEX HEX HEX HEX HEX HEX HEX HEX | Unicode code point in the range U + 0 to U + 10FFFF inclusive, corresponding to the encoded hexadecimal value.


Simple detection of Unicode characters in general, since the second byte is 0 if ASCII:

L"a" = 97, 0

. Unicode 0:

L"愛" = 27, 97

\u611b. UTF-32, , UTF-16 8 ?

, , UTF-16 , .

L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9

, Eat, drink, 愛, Eat, drink, \u611b ( UTF-16). UTF-32, \U8902611b ( UTF-32), , UTF-16 UTF-32 . , UTF-32 UTF-16 wchar_t ?

+4
1

, .

Q. ++, "Eat, drink, 愛", UT8-8, UTF-16 UTF-32?
A. . UTF-8, . .

Q. ++, L"Eat, drink, 愛", UT8-8, UTF-16 UTF-32?
A. . UTF-32. UTF-16. . .

Q. UT8-8, UTF-16 UTF-32 ++?
A. ++ 11 :

u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."

++ 03 .

Q. "Eat, drink, 愛" UTF-32?
. , UTF-32 ( UTF-16 UTF-8). UTF-32 .. Unicode.

Q. - ?
, Unicode. ++ , 32- , . ( "" " ", , ).

Q. Unicode, ?
A. . 256 65535, 2- (4 ) escape-. 65535, 3- (6 ) escape-. , .

Q. UTF-32, ?
A. ( ) ( ). . .

Q. UTF-16, ?
A. ( ) 0xD800 0xDFFF Unicode . , 2- (4 ) escape-. 0xD800 0xDFFF , ( ) U + 10000 U + 10FFFF. 3- (6 ) escape-. (v1, v2) , :

c = (v1 - 0xd800) >> 10 + (v2-0xdc00)

, 0xd800..0xdbff, - 0xdc00..0xdfff, .

Q. UTF-8, ?
A. UTF-8 , UTF-16, . , .

Q. L "प्रे" ?
A. , Unicode, U + 092A, U + 094D, U + 0930, U + 0947. , , , , UTF-16. "", " ". . , . , , . , .

+9

Source: https://habr.com/ru/post/1541864/


All Articles