How to change what PCRE regexp thinks is a newline in multiline mode?

With PCRE regular expressions in PHP, multi-line mode ( /m ) allows ^ and $ match the beginning and end of lines (separated by newlines) in the source text, as well as at the beginning and end of the source text.

This works fine on Linux with \n (LF), which is a newline separator, but with an error on Windows with \r\n (CRLF).

Is there a way to change what PCRE thinks is newlines? Or perhaps let it match CRLF or LF in the same way that $ matches the end of line / line?

Example:

 $EOL = "\n"; // Linux LF $SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four"; if (preg_match('/^two$/m',$SOURCE_TEXT)) { echo 'Found match.'; // <<< RESULT } else { echo 'Did not find match!'; } 

RESULT: Success

 $EOL = "\r\n"; // Windows CR+LF $SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four"; if (preg_match('/^two$/m',$SOURCE_TEXT)) { echo 'Found match.'; } else { echo 'Did not find match!'; // <<< RESULT } 

RESULT: Failure

+6
source share
4 answers

Have you tried (*CRLF) and its associated modifiers? They are described in detail in Wikipedia here (according to Newline / linebreak parameters) and seem to do the right thing in my testing. that is, '/(*CRLF)^two$/m' should match the windows \r\n newlines. Also (*ANYCRLF) should match both linux and windows, but I have not tested this.

+9
source

Note: The answer applies only to older versions of PHP, when I wrote it, I did not know about the available sequences and modifiers: \R , (*BSR_ANYCRLF) and (*BSR_UNICODE) . See also the answer to the question: How to replace various newline styles with PHP in the smartest way?

In PHP, it is not possible to specify a newline for a PCRE regular expression pattern. The m modifier only searches for \n which is documented . And there is no runtime setting to make the change that would be possible in perl, but this is not an option with PHP.

Usually I just change the line before using it with preg_match and the like:

 $subject = str_replace("\r\n", "\n", $subject); 

It may not be exactly what you are looking for, but it probably helps.

Edit: Regarding the Windows EOL example you added to your question:

 $EOL = "\r\n"; // Windows CR+LF $SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four"; if (preg_match('/^two$/m',$SOURCE_TEXT)) { echo 'Found match.'; } else { echo 'Did not find match!'; // <<< RESULT } 

This fails because the text has \R after two . So, two not at the end of the line, an additional \R character appears before the end of the line ( $ ).

The PHP manual clearly explains that only \n is considered a character that indicates the end of a string. $ only considers \n , so if you are looking for two\r at the end of the line, you need to change your template. This is another option (instead of converting text as suggested above).

+5
source

This is strange, I don’t think that $ (with the m modifier) ​​cares whether there is \n or \r\n as a new line.

The idea is to test this, add \s* in front of $ . \s also matches newline characters and must match \r before \n if it really is a problem.
While there is no problem, if there are additional spaces at the end of the line, this should not hurt.

+3
source

It all depends on where your data comes from: external and uncontrolled sources can provide pretty dirty data. A hint for those of you who are trying to discourage (or at least work out) the problem of the correct pattern matching at the end ($) of any line in multi-line mode (/ m).

 <?php // Various OS-es have various end line (aka line break) chars: // - Windows uses CR+LF (\r\n); // - Linux LF (\n); // - OSX CR (\r). // And that why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?). $str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_"; // C 3 p 0 _ $pat1='/\w$/mi'; // This works excellent in JavaScript (Firefox 7.0.1+) $pat2='/\w\r?$/mi'; // Slightly better $pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly $pat4='/\w(?=\R)/i'; // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected $pat5='/\w\v?$/mi'; $pat6='/(*ANYCRLF)\w$/mi'; // Excellent but undocumented on php.net at the moment (described on pcre.org and en.wikipedia.org) $n=preg_match_all($pat1, $str, $m1); $o=preg_match_all($pat2, $str, $m2); $p=preg_match_all($pat3, $str, $m3); $r=preg_match_all($pat4, $str, $m4); $s=preg_match_all($pat5, $str, $m5); $t=preg_match_all($pat6, $str, $m6); echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true) ."\n2 !!! $pat2 ($o): ".print_r($m2[0], true) ."\n3 !!! $pat3 ($p): ".print_r($m3[0], true) ."\n4 !!! $pat4 ($r): ".print_r($m4[0], true) ."\n5 !!! $pat5 ($s): ".print_r($m5[0], true) ."\n6 !!! $pat6 ($t): ".print_r($m6[0], true); // Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 and $pat4 (\R), $pat5 (\v) and altered newline option in $pat6 ((*ANYCRLF)) - for some applications at least. /* The code above results in the following output: ABC ABC 123 123 def def nop nop 890 890 QRS QRS ~-_ ~-_ 1 !!! /\w$/mi (3): Array ( [0] => C [1] => 0 [2] => _ ) 2 !!! /\w\r?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 3 !!! /\w\R?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 4 !!! /\w(?=\R)/i (6): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S ) 5 !!! /\w\v?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 6 !!! /(*ANYCRLF)\w$/mi (7): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S [6] => _ ) */ ?> 

Unfortunately, I do not have access to the server with the latest version of PHP - my local PHP is 5.3.8, and my public PHP host is version 5.2.17.

0
source

Source: https://habr.com/ru/post/893484/


All Articles