Short version
In the code below, $1 messed up, and I don't understand why.
Long version
I run Foswiki on a system with perl v5.14.2 with taint> taint validation mode enabled. Debugging problems with this installation I was able to create the next SSCCE. (Please note that I edited this post, the first version was longer and more complex, and the comments still apply to this.)
#!/usr/bin/perl -T use strict; use warnings; use locale; use Scalar::Util qw(tainted); my $var = "foo.bar_baz"; $var =~ m/^(.*)[._](.*?)$/; print(tainted($1) ? "tainted\n" : "untainted\n");
Although the input string $var not used and the regular expression is fixed, the resulting capture group $1 corrupted. Which seems very strange to me.
perlsec manual has this to say about taint and regular expressions:
Values โโcan be unaffected by using them as keys in a hash; otherwise, the only way to get around the emptying mechanism is by referencing the subpatterns from the regular expression. Perl assumes that if you reference a substring using $1 , $2 , etc., so that you know what you were doing when you wrote the template.
I would suggest that even if the entrance was spoiled, the exit would still remain unoccupied. To observe the opposite, spoiled output from an unoccupied input, feels like a weird bug in perl. But if you read more perlsec, it also points users to the SECURITY section of perllocale . There we read:
when the use locale is used, Perl uses the tainting mechanism (see perlsec) to mark string results that become language dependent and that may be unreliable. Here is a summary of the behavior of operators and functions that may be affected by a locale:
Comparison Operators ( lt , le , ge , gt and cmp ) [...]
Interpolation of the mapping of cases (with \l , \l , \u or \u ) [...]
Match operator ( m// ):
Scalar true / false result is never corrupted.
Subpanels, either sent as a result of a context list, or as $1 , etc. spoiled if the use locale is valid (but not use locale ':not_characters' ), and the regular subpattern expression contains \w (to match an alphanumeric character), \w (not an alphanumeric character), \s (whitespace), or \s (asymmetric character). The pattern matching variable, $& , $` (pre-match), $' (after the match) and $+ (last match) is also tainted if the usage locale is used and the regular expression contains \w , \w , \s or \s .
Substitution Operator ( s/// ) [...]
[โฎ]
It seems like this should be an exhaustive list. And I donโt see how this applies: my regex does not use any \w , \w , \s or \s , so it should not be language dependent.
Can someone explain why this code violates varibale $1 ?