Regular expressions - C # behaves differently than Perl / Python

In Python:

ttsiod@elrond :~$ python >>> import re >>> a='This is a test' >>> re.sub(r'(.*)', 'George', a) 'George' 

In Perl:

 ttsiod@elrond :~$ perl $a="This is a test"; $a=~s/(.*)/George/; print $a; (Ctrl-D) George 

In C #:

 using System; using System.Collections.Generic; using System.Text; using System.Threading; using System.Text.RegularExpressions; namespace IsThisACsharpBug { class Program { static void Main(string[] args) { var matchPattern = "(.*)"; var replacePattern = "George"; var newValue = Regex.Replace("This is nice", matchPattern, replacePattern); Console.WriteLine(newValue); } } } 

Unfortunately, C # prints:

 $ csc regexp.cs Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.5420 for Microsoft (R) .NET Framework version 3.5 Copyright (C) Microsoft Corporation. All rights reserved. $ ./regexp.exe GeorgeGeorge 

Is this a bug in the C # regex library? Why does he print George twice when Perl and Python just print it once?

+6
source share
4 answers

In your example, the difference seems to be in the semantics of the "replace" function, and not in the regular expression processing itself.

.net performs a "global" replacement, i.e. replaces all matches, not just the first match.

Global Substitution in Perl

(note the small "g" at the end of the line = ~ s)

 $a="This is a test"; $a=~s/(.*)/George/g; print $a; 

which produces

 GeorgeGeorge 

Single Replace in .NET

 var re = new Regex("(.*)"); var replacePattern = "George"; var newValue = re.Replace("This is nice", replacePattern, 1) ; Console.WriteLine(newValue); 

which produces

 George 

since it stops after the first change.

+5
source

It is not clear to me whether this will be a mistake or not, but if you change .* To .+ , It will do what you want. I suspect that (.*) Matches an empty string that confuses things.

It is supported by the following code:

 using System; using System.Text.RegularExpressions; class Test { static void Main() { var match = Regex.Match("abc", "(.*)"); while (match.Success) { Console.WriteLine(match.Length); match = match.NextMatch(); } } } 

This prints 3, then 0. Changing the pattern to "(.+)" Makes it easy to print 3.

It should be noted that this has nothing to do with C # as a language - only standard .NET libraries. It is worth distinguishing between language and libraries - for example, you will get exactly the same behavior if you use the standard .NET library from F #, VB, C ++ / CLI, etc.

+2
source

Replacing "" equals "George" ( .* Matches "" )

and

 "This is a start" == "This is a start" + "" 

So, the regular expression matches "This is a start" and replaces it with "George" , and now its "cursor" is at the end of the line, where it again tries to match the remaining line ( "" ) with the pattern. He has a match, so he adds a second "George" . I don’t know if it is right or wrong.

I will add that the Javascript mechanism seems to do the same (tested here: http://www.regular-expressions.info/javascriptexample.html ) in IE and Chrome.

+2
source

Is this a bug in the C # regex library

Perhaps, but this does not really answer your question:

Regular expressions - C # behaves differently than Perl / Python

Different mechanisms and implementations of regular expressions behave differently. In some cases, this is explicit (and includes support for various elements and regular expression syntax: for example, using \( and \) to group instead of the usual backslash brackets for grouping).

The book Mastering Regular Expressions (Jeffrey EF Friedl, O'Reilly) spends a lot of time explaining these differences (in addition to the more fundamental differences between non-deterministic finite state machines (NFA) and approaches of deterministic finite state machines (DFA)).

PS. As others have noted,. .* Matches an empty string, so first “everything” your input line is matched and replaced, then the empty line at the end of the input is matched and replaced. If you want to combine the whole, but possibly empty, the input includes anchors for start and end: ^(.*)$ .

+2
source

Source: https://habr.com/ru/post/896299/


All Articles