Is it possible to use `strstr` to search for multibyte UTF-8 characters in a string?

Following my previous question: Why does `strchr` seem to work with multi-byte characters, despite the rejection of the man page? I realized that strchr was a bad choice.

Instead, I am thinking of using strstr to search for a single character (multi-byte non char ):

 const char str[] = "This string contains é which is a multi-byte character"; char * pos = strstr(str, "é"); // 'é' = 0xC3A9: 2 bytes printf("%s\n", pos); 

Ouput:

é which is a multibyte character

This is what I expect: the 1st byte position of my multibyte character.

A priori, this is not a canonical use of strstr , but it seems to work well.
Is this workaround safe? Can you think of any side effect or special case that might cause an error?

[EDIT]: I have to clarify that I do not want to use the wchar_t type and that the lines that I process are encoded in UTF-8 encoding (I know this choice can be discussed, but this is an irrelevant debate)

+7
source share
3 answers

edit
Based on an updated question from OP that “can such a false positive result exist in the context of UTF-8,” so the UTF-8 answer is designed in such a way that it is not subject to partial mismatch of characters, as shown above, and causes any false positive results , So using strstr with multibyte characters encoded in UTF-8 is completely safe.

Original answer
No strstr is suitable for strings containing multibyte characters.

If you are looking for a string that does not contain a multibyte character inside a string that contains a multibyte character, this may give a false positive. (When using shift-jis encoding in Japanese, strstr ("掘 something", "@some") can give a false positive result)

 +---------+----+----+----+ | c1 | c2 | c3 | c4 | <--- string +---------+----+----+----+ +----+----+----+ | c5 | c2 | c3 | <--- string to search +----+----+----+ 

If the final part of c1 (by chance) matches c5, you may get the wrong result. I would suggest using Unicode with Unicode substring checking or multibyte substring checking. ( _mbsstr for example)

+7
source

Modern systems use UTF-8 (or ASCII) as their multibyte encoding, where using this function is safe.

To be strictly compatible and make your code work even on old / exotic platforms, you need to consider additional problems.

Firstly, the good news: in every multibyte encoding, a 0-byte character indicates the end of the line, regardless of state. This means that your strstr will not crash or anything else, but the result may be wrong.

As an example, consider UTF-7, a 7-bit pure Unicode encoding method. UTF-7 is a multibyte encoding having a shift state, which means that the interpretation of the byte may depend on the context in which it appears. For instance. (see Wikipedia ) "£ 1AKM" is encoded as +AKM-AKM in UTF-7, where the + sign changes the state and interpretation of type A letters. Executing strstr(str, "AKM") will correspond to the first part of AKM (after + ), although this is part of the encoding £ and should actually correspond to the part of AKM after - (setting the shift state to the initial state).

+1
source

Is this a workaround safe? Can you think of any side effect or special case that might cause an error?

One side effect is that if strtr() does not find any match, then you will print a null pointer value that will cause a Segmentation fault .

You should check if the pointer is NULL before printing the line. Check it out as follows:

 if(pos == NULL) printf("letter not found"); else printf("%s\n", pos); 
-2
source

Source: https://habr.com/ru/post/1201428/


All Articles