A vague language in the specification of strtol et al.

The specification for strtol conceptually divides the input string into "leading spaces", "sequence of subjects" and "ending string" and defines a "sequence of subjects" as:

The longest initial subsequence of the input string, starting with the first character of a non-white space that has the expected shape. The subject sequence must not contain characters if the input string is empty or consists entirely of space characters or if the first character of a non-white space is different from a sign or a valid letter or number.

At one time I thought that the business with the longest initial subsequence is akin to how scanf works, where " 0x@ " will scan as "0x" , a bad match, followed by "@" as the next unread character, However after some discussion, I'm basically convinced that strtol handles the longest starting subsequence that has the expected shape, and not the longest starting string, which is the starting subsequence of some possible string of the expected shape.

What still confuses me is this language in the spec:

If the sequence of objects is empty or does not have the expected shape, the conversion is not performed; the str value is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

If we accept what seems to be the correct definition of “subject sequence,” there is no such thing as a non-empty subject sequence that does not have the expected shape, and instead (to avoid redundancy and confusion) the text should simply read:

If the sequence of objects is empty, no conversion is performed; the str value is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Can someone clarify these issues for me? Perhaps a reference to past discussions or any relevant defect reports would be helpful.

+6
source share
4 answers

I think the C99 language is pretty clear:

The subject’s sequence is defined as the longest initial subsequence of the input string, starting with the first character of a non-white space, that is, the expected shape.

Given " 0x@ " , " 0x@ " does not have the expected form; "0x" does not have the expected shape; therefore, "0" is the longest initial subsequence that has the expected form.

I agree that this means that you cannot have a non-empty topic that does not fit the expected form - unless you interpret the following:

Unlike the locale standard "C" , an additional locale-specific object sequence form can be adopted.

... allowing the language to determine other possible forms that may have a subjective sequence, which, however, do not belong to the "expected form".

The wording in the final paragraph appears simply as “a belt and brackets”.

+3
source

This might be easier to understand if you started in §7.20.1.4 (strtol, strtoll, strtoul and strtoull functions) ¶2 of the C99 standard instead of ¶4:

¶2 The functions strtol, strtoll, strtoul and strtoull convert the initial part of the string pointed to by nptr to long int, long long int, unsigned long int and unsigned long long int, respectively. First, they decompose the input string into three parts: the initial, possibly empty, sequence of space characters (as indicated in the isspace function), a sequence of objects resembling an integer represented in a certain radius, determined by the value of the base, and the final string of one or more unrecognized characters, including the terminating null character of the input string. Then they try to convert the sequence of objects into an integer and return the result.

¶3 If the base value is zero, the expected form of the subject's sequence is an integer constant, as described in 6.4.4.1, not necessarily preceding the plus or minus sign, but not including the integer suffix. If the base value is from 2 to 36 (inclusive), the expected form of the subject sequence is a sequence of letters and numbers representing an integer with a radius specified by the base, not necessarily preceding the plus or minus sign, but not including the integer suffix. The letters from (or A) to z (or Z) are ascribed values ​​from 10 to 35; only letters and numbers whose attributed values ​​are less than the bases. If the base value is 16, the characters 0x or 0X may optionally precede the sequence of letters and numbers, following the sign if present.

¶4 Subject sequence is defined as the longest initial subsequence of the input string, ...

In particular, ¶3 explains what constitutes a sequence of objects.

+2
source

The POSIX specification for strtol seems more understandable:

These functions must convert the initial part of the string pointed to by str to the long and long representation of the type, respectively. First, they decompose the input string into three parts:

  • The original, possibly empty sequence of space characters (as indicated by isspace ())

  • A user sequence interpreted as an integer represented in a certain radius, determined by the value of base

  • The final string of one or more unrecognized characters, including the terminating NUL character of the input string.

Then they will try to convert the sequence of objects into an integer and return the result.

But, of course, it is not normative and "departs from the ISO C standard."

+1
source

I completely agree with your assessment: by definition, all non-empty plot sequences have the expected form, so the wording of the standard is doubtful.

In the case of floating point conversion functions, there is one more error (section C99: TC3 7.20.1.3, §3):

[...] The thematic sequence is defined as the longest initial subsequence of the input line, starting with the first non-white space that has the expected shape. A topic sequence does not contain characters if the input string does not belong to the expected form.

This means that the entire input string should have the expected shape, surpassing the purpose of the endptr parameter. It can be argued that the expected form for the input string is different from the expected form for the subject sequence, but it is still rather confusing.

You are also true that the semantics of the strto*() and *scanf() function strto*() different: if they match, they will always agree on the value and consume the same number of characters (and any libc implementation where they are not broken, including newlib and glibc last time I checked), but *scanf() not suitable for cases when it will need to cancel more than one character, as in your examples " 0x@ " and "1.0e+" .

+1
source

Source: https://habr.com/ru/post/892799/


All Articles