Coding affects the result of strstr () (and related functions)

Does character encoding encode affect the result of strstr ()?

For example, I read the data in buf and do the following:

char *p = strstr (buf, "UNB"); 

I wonder if the data in ASCII or others (for example, EBCDIC) affects the result of this function? (Since "UNB" are different bitstreams under different encoding methods ...)

If so, what is the default value for these functions? (ASCII?)

Thanks!

0
source share
5 answers

C functions, such as strstr , work with raw char data, regardless of encoding. In this case, you potentially have two different encodings: one that is used by the compiler for the string literal, and one that you used when filling out buf . If this is not the same, then the function may not work as expected.

As for the default encoding, there is no one, at least not as a standard; "the main symbol of the execution of" set "is the implementation. In practice, systems that do not use the encoding obtained from ASCII (ISO 8859-1 seems to be the most common, the least here in Europe) are extremely rare. As for the encoding you get in buf , which depends on where the characters istream from; if you are reading with istream , it depends on the locale imbue d in the stream. In practice, however, again, almost all of them (UTF-8, ISO8859-x, etc. .) obtained from ASCII and are identical to ASCII for all characters in the main characters (which includes all persons Ms legitimate traditional C). Thus, for "UNB" , you'll probably be safe. (but for something like "รผรฉรข" , you will almost certainly not.)

+3
source

The string constant ("UNB") is encoded in the encoding of the source file, so it must match the encoding of your buffer

+3
source

Both string parameters must have the same encoding. With string literals, C ++ source encoding (platform encoding). For Unicode, the UTF-8 function has another problem: Unicode has diacritical letters, but they can also be encoded as a base letter plus a combined diacritic character. รฉ can be one letter [รฉ] or two: [e] + [combination- '] . Normalization exists.

For Java, it becomes useful (very quiet development) to explicitly set the source encoding to UTF-8. For C ++ projects, I donโ€™t know the widespread conventions.

+1
source

strstr should work without problems with Unicode encoded UTF-8 encoded characters.

0
source

with this function, data is encoded in ASCII.

-1
source

Source: https://habr.com/ru/post/1201436/


All Articles