Regular expression matching any subset of a given set?

Is it possible to write a regular expression that matches any subset of a given character set
a1 ... an ?
That is, it must match any string where any of these characters appears no more than once, there are no other characters, and the relative order of the characters does not matter.

Some approaches that arise immediately:
1. [a1,...,an]* or (a1|a2|...|an)* - this allows the multiple presence of characters
2. (a1?a2?...an?) - there is no multiple presence, but the relative order is important - this corresponds to any subsequence, but not a subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1) , i.e. Writing down all possible subsequences (just hardcode all the corresponding lines :)) is, of course, unacceptable.

I also suggest that this may be theoretically impossible, because during parsing, we will need to remember which character we have seen before, and as far as I know, regular expressions can only check the correct linear languages.

Any help would be appreciated. Thanks in advance.

+3
source share
4 answers

It is impossible to think about how to do this with a single regex, but this is one way to do it with n regexes: (I will use 1 2 ... m n etc. for your a s)

 ^[23..n]*1?[23..n]*$ ^[13..n]*2?[13..n]*$ ... ^[12..m]*n?[12..m]*$ 

If all of the above matches, your string is a strict subset of 12..mn .

How it works: each line requires that the line consist of exactly:

  • any number of characters taken from the set except a particular one
  • maybe a particular one
  • any number of characters taken from the set except a particular one

If this passes when each element in turn is considered a particular one , we know:

  • there is nothing in the line except allowed items
  • no more than one of the valid elements exists

as needed.


for completeness, I must say that I would only do this if I were ordered to "use regular expression"; if not, I would keep track of which valid elements were noticed, and iterate over the characters of the string, doing the obvious thing.

+3
source

It really doesn't match the language-agnostic , but ...

 ^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$ 

see demo at ideone.com

The first time an element is matched, it gets "marked" by the capture group following it. Since the group was now participating in the match, a negative result for its corresponding backreference (for example, (?!\1) ) will never match again, although the group only fixed an empty line. This is an undocumented function, which, however, is supported in many ways, including Java, .NET, Perl, Python, and Ruby.

This solution also requires support for direct links (i.e. links to a given capture group ( \1 ) that appears in the regular expression before the group itself). This seems to be slightly less widely supported than the trick with empty groups.

+4
source

Not sure if you can get extended regex for this, but it's pretty easy to do with a simple traversal of your string.

You use a hash (or an array or something else) to store if any of your allowed characters have already been seen or not in the string. Then you just iterate over the elements of your string. If you encounter an item not in your permitted set, you will help out. If it is allowed, but you have already seen it, you will help out too.

In pseudo code:

 foreach char a in {a1, ..., an} hit[a1] = false foreach char c in string if c not in {a1, ..., an} => fail if hit[c] => fail hit[c] = true 
+1
source

Like Alan Moore, using only \ 1 and does not belong to the capture group before he was spotted:

 #!/usr/bin/perl my $re = qr/^(?:([abc])(?!.*\1))*$/; foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') { print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n"; } 

We match any number of blocks (external (? :)), where each block must consist of "exactly one character from our preferred set, which is not followed by a line containing that character".

If a string can contain newline characters or other fun things, you may need to play with some flags to make ^, $ and. behave as intended, but it all depends on the specific taste of RE.

Just for stupidity, you can effectively use the positive forward statement and two regular expressions, so we can test any permutation abc, claiming that the above matches, followed by the usual check for ', are N characters long and consists of these characters:

 my $re2 = qr/^(?=$re)[abc]{3}$/; foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') { print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n"; } 
0
source

Source: https://habr.com/ru/post/901164/


All Articles