Objective-C How to get Unicode character

Question

Objective-C How to get Unicode character

I want to get the Unicode code point for a given Unicode character in Objective-C. NSString said it uses UTF-16 internal encoding and says

The NSString class has two primitive methods - length and characterAtIndex: - this provides the basis for all other methods in its interface. The length method returns the total number of Unicode characters in a string. characterAtIndex: provides access to each character in the string by index, index values start at 0.

It looks like the characterAtIndex method is known in unicode format. However, it returns unichar - it is a 16-bit unsigned int type.

- (unichar)characterAtIndex:(NSUInteger)index

Questions:

Q1: how does it represent the unicode code above UFFFF?
Q2: If Q1 makes sense, is there a way to get the Unicode code point for a given Unicode character in Objective-C.

thanks.

+4

objective-c unicode

Favo yang Jan 18 '11 at 16:42

source share

2 answers

johne · Answer 1 · 2011-01-18T21:57:21+0000

A short answer to the question “Q1: how does it represent the Unicode code code above UFFFF?”: You need UTF16 to know and correctly handle surrogate code points . The information and links below should contain pointers and sample code that allow you to do this.

The NSString documentation is correct. However, although you said that "NSString says it uses UTF-16 encoding for internal use," it’s more accurate to say that the open / abstract interface for NSString based on UTF16 . The difference is that this leaves the internal string representation of the private implementation detail, but public methods like characterAtIndex: and length are always in UTF16 .

The reason for this is that it strives to best balance the high-order ASCII -centric and Unicode strings, mainly because Unicode is a strict superset of ASCII ( ASCII uses 7 bits, for 128 characters that map to the first 128 Unicode code points )

Introduce Unicode Code Points that are> U+FFFF , which clearly exceeds what can be represented in one UTF16 Code Code , UTF16 uses special “Surrogate Code Points” to form a “Surrogate Pair” , which when combined will form a single Unicode code point > U+FFFF . You can find information about this at:

Unicode UTF FAQ - What are Surrogates?
Unicode UTF FAQ - What is the algorithm for converting from UTF-16 to character codes?
Although the official Unicode UTF FAQ - How to write a UTF converter? now recommends using "International Unicode Components" , he recommended some code officially authorized and supported by Unicode. Although it is no longer available directly from Unicode.org, you can still find copies of the “more official” sample code in various open source projects: ConvertUTF.c and ConvertUTF.h . If you need to roll on your own, I highly recommend that you study this code first, as it is well tested.

David mitchell · Answer 2 · 2011-01-18T16:53:08+0000

From the length documentation:

The return number includes individual characters in a sequence of characters, so you cannot use this method to determine whether a string will be visible when printed or whether it will appear for a long time.

From this, I would conclude that any characters above U + FFFF will be considered two characters and will be encoded as a surrogate pair (see the corresponding entry at http://unicode.org/glossary/ ).

If you have a UTF-32 encoded string with the character you want to convert, you can create a new NSString using initWithBytesNoCopy:length:encoding:freeWhenDone: and use its result to determine how the character is encoded in UTF-16, but if you are going to do very heavy Unicode processing, your best bet is probably to get to know the ICU (http://site.icu-project.org/).

Objective-C How to get Unicode character

More articles: