UTF8 support when using cross platform C

I am developing a cross-platform C application (C89) that has to deal with UTF8 text. All I need is the basic string manipulation functions like substr , first , last , etc.

Question 1

Is there a UTF8 library that implements the above functions? I have already studied ICU, and it is too large for my requirement. I just need to support UTF8.

I found a UTF8 decoder here . The following function prototypes are from this code.

 void utf8_decode_init(char p[], int length); int utf8_decode_next(); 

The initialization function accepts an array of characters, but utf8_decode_next() returns an int . Why is this? How can I print the characters returned by this function using standard functions like printf ? A function deals with character data and how can it be assigned to an integer?

If the aforementioned decoder is not suitable for production code, do you have any recommendations?

Question 2

I, too, was embarrassed reading articles that said you need to use wchar_t for unicode. In my opinion, this is not required since regular C strings may contain UTF8 values. I checked this by looking at the source code of SQLite and git. SQLite has the following typedef type.

 typedef unsigned char u8 

Do I understand correctly? Also requires an unsigned char ?

+4
source share
6 answers
  • The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything less than int , and it can be argued that technically it should be long , since int can be a 16-bit number. Effectively, the function returns you the UTF-32 character.

    You will need to look at the extended C94 character extensions on C89 to print wide characters ( wprintf() , <wctype.h> , <wchar.h> ). However, wide characters are not guaranteed to be UTF-8 or even Unicode. Most likely, you will not be able to print characters from utf8_decode_next() portable, but it depends on your portability requirements. The wider the range of systems to which you must connect, the less likely it is that it all just works. To the extent that you can write UTF-8 portable, you should send a UTF-8 string (and not an array of UTF-32 characters obtained from utf8_decode_next() ) to one of the usual print functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely unaware of this.

  • You need to understand that a 4-byte wchar_t can contain any Unicode code in one block, but for UTF-8 it may take from one to four 8-bit bytes (1-4 storage units) to store a single Unicode codeword. On some systems, I believe wchar_t may be a 16-bit ( short ) integer. In this case, you are forced to use UTF-16, which encodes Unicode codes outside the base multilingual plane (BMP, codes U + 0000 .. U + FFFF) using two storage units and surrogates.

    Using unsigned char makes life easier; plain char often signed. The presence of negative numbers makes life more difficult than I need (and, believe me, it is quite difficult, without complexity).

+4
source

You do not need special library routines to search for characters or substrings using UTF-8. strstr does everything you need. That the whole point of UTF-8 and the design requirements that he invented are met.

+4
source

GLib has many related functions and can be used independently of GTK +.

+2
source

Unicode has over 100,000 characters. Most C. implementations have 256 possible char values.

Therefore, UTF-8 uses more than one char to encode each character, and the decoder needs a return type that is greater than char .

wchar_t is a larger type than char (well, it shouldn't be bigger, but usually it). It represents the characters of a character set defined for implementation. In some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "base multilingual plane"), it is still not large enough to represent any Unicode character, which seems to be the reason that the decoder you are using uses int .

You cannot print wide characters with printf because it deals with char . wprintf deals with wchar_t , so if the wide character set is unicode, and if wchar_t is int on your system (like Linux), then wprintf and friends will print the decoder output without further processing. Otherwise it will not be.

In any case, you cannot transfer arbitrary characters to Unicode, since there is no guarantee that the terminal can display them, or even that a wide range of characters is in any way associated with Unicode.

SQLite probably used an unsigned char to:

  • they know the signature - this is an implementation that determines whether the char signed or not.
  • they can do right shifts and assign values ​​out of range, and also get consistent and definite results in all C implementations. Executions have more freedom than a signed char behaves than an unsigned char .
+1
source

Normal C lines are great for storing utf8 data, but you cannot easily find a substring in your utf8 line. This is because a character encoded as a sequence of bytes using utf8 encoding can be from one to four bytes, depending on the character. that is, the "character" is not equivalent to the "byte" for utf8, as well as for ASCII.

To search for a substring, etc., you will need to decode it in some internal format, which is used to represent Unicode characters, and then search for the substring. Since the number of Unicode characters does not exceed 256 characters, a byte (or char) is not enough. This is why the library you found uses ints.

As for your second question, maybe it's just because it doesn't make sense to talk about negative characters, so they can also be listed as β€œunsigned”.

0
source

I implemented substr and length functions that support UTF8 characters. This code is a modified version of what SQLite uses.

The following macro traverses input text and skips all multibyte characters in a sequence. if checks that it is a multibyte sequence and the loop inside it increments input until it finds the next main byte.

 #define SKIP_MULTI_BYTE_SEQUENCE(input) { \ if( (*(input++)) >= 0xc0 ) { \ while( (*input & 0xc0) == 0x80 ){ input++; } \ } \ } 

substr and length implemented using this macro.

 typedef unsigned char utf8; 

zbz

 void *substr(const utf8 *string, int start, int len, utf8 **substring) { int bytes, i; const utf8 *str2; utf8 *output; --start; while( *string && start ) { SKIP_MULTI_BYTE_SEQUENCE(string); --start; } for(str2 = string; *str2 && len; len--) { SKIP_MULTI_BYTE_SEQUENCE(str2); } bytes = (int) (str2 - string); output = *substring; for(i = 0; i < bytes; i++) { *output++ = *string++; } *output = '\0'; } 

Length

 int length(const utf8 *string) { int len; len = 0; while( *string ) { ++len; SKIP_MULTI_BYTE_SEQUENCE(string); } return len; } 
0
source

Source: https://habr.com/ru/post/1332923/


All Articles