Regular expression perlations

Short version

In the code below, $1 messed up, and I don't understand why.

Long version

I run Foswiki on a system with perl v5.14.2 with taint> taint validation mode enabled. Debugging problems with this installation I was able to create the next SSCCE. (Please note that I edited this post, the first version was longer and more complex, and the comments still apply to this.)

 #!/usr/bin/perl -T use strict; use warnings; use locale; use Scalar::Util qw(tainted); my $var = "foo.bar_baz"; $var =~ m/^(.*)[._](.*?)$/; print(tainted($1) ? "tainted\n" : "untainted\n"); 

Although the input string $var not used and the regular expression is fixed, the resulting capture group $1 corrupted. Which seems very strange to me.

perlsec manual has this to say about taint and regular expressions:

Values โ€‹โ€‹can be unaffected by using them as keys in a hash; otherwise, the only way to get around the emptying mechanism is by referencing the subpatterns from the regular expression. Perl assumes that if you reference a substring using $1 , $2 , etc., so that you know what you were doing when you wrote the template.

I would suggest that even if the entrance was spoiled, the exit would still remain unoccupied. To observe the opposite, spoiled output from an unoccupied input, feels like a weird bug in perl. But if you read more perlsec, it also points users to the SECURITY section of perllocale . There we read:

when the use locale is used, Perl uses the tainting mechanism (see perlsec) to mark string results that become language dependent and that may be unreliable. Here is a summary of the behavior of operators and functions that may be affected by a locale:

  • Comparison Operators ( lt , le , ge , gt and cmp ) [...]

  • Interpolation of the mapping of cases (with \l , \l , \u or \u ) [...]

  • Match operator ( m// ):

    Scalar true / false result is never corrupted.

    Subpanels, either sent as a result of a context list, or as $1 , etc. spoiled if the use locale is valid (but not use locale ':not_characters' ), and the regular subpattern expression contains \w (to match an alphanumeric character), \w (not an alphanumeric character), \s (whitespace), or \s (asymmetric character). The pattern matching variable, $& , $` (pre-match), $' (after the match) and $+ (last match) is also tainted if the usage locale is used and the regular expression contains \w , \w , \s or \s .

  • Substitution Operator ( s/// ) [...]

[โ‹ฎ]

It seems like this should be an exhaustive list. And I donโ€™t see how this applies: my regex does not use any \w , \w , \s or \s , so it should not be language dependent.

Can someone explain why this code violates varibale $1 ?

+6
source share
1 answer

Currently there is a discrepancy between the documentation mentioned in the question and the actual implementation from the point of view of perl 5.18.1. The problem is character classes. The documentation mentions \w , \s , \w , \s in what sounds like an exhaustive list, while the implementation hides almost all of the use [โ€ฆ] .

The correct solution should probably be somewhere in between: character classes such as [[:word:]] should be corrupted because it depends on the language. My fixed list should not. Character ranges, such as [az] , depend on sorting, so in my personal opinion they should also fall. \d depends on which language unit considers the number, so it must be corrupted, even if it is not one of the escape sequences mentioned so far, nor a class in square brackets.

So, in my opinion, both the correction of the documentation and the implementation are necessary. Perl developers are working on this. See the perl error report I submitted for progress information.

For a fixed list of characters, one viable workaround is presented as a disjunction, i.e. (?:\.|_) instead of [._] . It is more detailed, but should work even with current (in my opinion, errors) versions of perl.

0
source

Source: https://habr.com/ru/post/959250/


All Articles