Relx Group in Perl: how to capture elements into an array from a regex group that matches an unknown number / multiple / variable occurrences from a string?

Question

Relx Group in Perl: how to capture elements into an array from a regex group that matches an unknown number / multiple / variable occurrences from a string?

In Perl, how can I use one group of regular expressions to capture more than one occurrence corresponding to it in several elements of the array?

For example, for a line:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this code:

 $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; my @array = $string =~ <regular expression here> for ( my $i = 0; $i < scalar( @array ); $i++ ) { print $i.": ".$array[$i]."\n"; }

I would like to see as a conclusion:

 0: var1=100 1: var2=90 2: var5=hello 3: var3="a, b, c" 4: var7=test 5: var3=hello

What will I use as a regular expression?

The commonality between the things I want to match here is the assignment string pattern, so something like:

 my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

If * indicates one or more occurrences corresponding to a group.

(I rejected the use of split () since some matches contain spaces within themselves (i.e. var3 ...) and therefore do not produce the desired results.)

With the above expression, I get:

 0: var1=100 var2

Is this possible in regular expression? Or is extra code required?

I already looked at the existing answers when searching for "perl regex multiple group", but there aren’t enough tips:

Work with multiple capture groups in multiple records
Multiple matches in a regex group?
Regex: recurring capture groups
Matching and grouping regular expressions
How to match a regular expression with a grouping with an unknown number of groups
awk extract multiple groups from each row
Combining and removing multiple regex groups
Perl: delete multiple abstract lines where a specific criterion is met
Match regular expressions across multiple groups per line?
PHP RegEx Group Multiple Matches
How to find multiple occurrences with regex groups?

+42

regex perl match grouping

therobyouknow Aug 11 '10 at 2:59 a.m.

source share

9 answers

Using regular expressions, use a technique that I like to call like-and-stretch: binding to functions that you know will be there (sticky), and then grab what's in between (stretching).

In this case, you know that one assignment corresponds to

 \b\w+=.+

and many of them are repeated in $string . Remember that \b means the word boundary:

A word boundary ( \b ) is a spot between two characters that has a \w on one side of it and a \w on the other side (in any order), counting the imaginary characters as the beginning and end of the line in accordance with \w .

The values in the assignments can be a little complicated to describe with a regular expression, but you also know that each value will end with a space, but not necessarily the first space encountered! - after another assignment or ending -string.

To avoid repeating the statement template, compile it once with qr// and reuse it in your template along with look-ahead assertion (?=...) to stretch the match far enough to fix the whole value, and also not let it flow into the next variable name.

Matching your pattern in a list context with m//g gives the following behavior:

The /g modifier defines global pattern matching, that is, matching as much as possible per line. How he behaves depends on the context. In the context of the list, it returns a list of substrings matching any parentheses in the regular expression. If there are no parentheses, it returns a list of all matching lines, as if there were parentheses around the entire pattern.

$assignment uses non-greedy .+? to disable the value as soon as the viewing window sees a different destination or end of line. Remember that a match returns substrings from all captured subpatterns, so alternate rotation uses non-capturing (?:...) . qr// , by contrast, contains implicit parentheses for parentheses.

 #! /usr/bin/perl use warnings; use strict; my $string = <<'EOF'; var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello EOF my $assignment = qr/\b\w+ = .+?/x; my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx; for ( my $i = 0; $i < scalar( @array ); $i++ ) { print $i.": ".$array[$i]."\n"; }

Output:

  0: var1 = 100
 1: var2 = 90
 2: var5 = hello
 3: var3 = "a, b, c"
 4: var7 = test
 5: var3 = hello

+7

Greg Bacon Aug 12 '10 at 13:01

source share

I am not saying that this is what you should do, but what you are trying to do is write a Grammar . Now your example is very simple for grammar, but the Damian Conway module Regexp :: Grammars is really great. If you need to grow it at all, you will find that it will make your life a lot easier. I use it quite a bit - it's kind of perl6-ish.

 use Regexp::Grammars; use Data::Dumper; use strict; use warnings; my $parser = qr{ <[pair]>+ <rule: pair> <key>=(?:"<list>"|<value=literal>) <token: key> var\d+ <rule: list> <[MATCH=literal]> ** (,) <token: literal> \S+ }xms; q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser; die Dumper {%/};

Output:

 $VAR1 = { '' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello', 'pair' => [ { '' => 'var1=100', 'value' => '100', 'key' => 'var1' }, { '' => 'var2=90', 'value' => '90', 'key' => 'var2' }, { '' => 'var5=hello', 'value' => 'hello', 'key' => 'var5' }, { '' => 'var3="a, b, c"', 'key' => 'var3', 'list' => [ 'a', 'b', 'c' ] }, { '' => 'var7=test', 'value' => 'test', 'key' => 'var7' }, { '' => 'var3=hello', 'value' => 'hello', 'key' => 'var3' } ]

+6

Evan Carroll Aug 11 '10 at 21:52

source share

A bit behind, maybe, but for me an occasion to study http://p3rl.org/Parse::RecDescent . How to make a parser?

 #!/usr/bin/perl use strict; use warnings; use Parse::RecDescent; use Regexp::Common; my $grammar = <<'_EOGRAMMAR_' INTEGER: /[-+]?\d+/ STRING: /\S+/ QSTRING: /$Regexp::Common::RE{quoted}/ VARIABLE: /var\d+/ VALUE: ( QSTRING | STRING | INTEGER ) assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; } startrule: assignment(s) _EOGRAMMAR_ ; $Parse::RecDescent::skip = ''; my $parser = Parse::RecDescent->new($grammar); my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello}; $parser->startrule($code);

gives:

 var1 => 100 var2 => 90 var5 => hello var3 => "a, b, c" var7 => test var8 => " haha \" heh " var3 => hello

PS. Note the var3 double variable, if you want the last assignment to be overwritten first, you can use the hash to store the values, and then use them later.

SFC. My first thought was to split by “=”, but it won’t work if the string contains “=”, and since regular expressions are almost always bad for parsing, I ended up trying to do this and it works.

Edit: Added support for escaped quotes inside quoted strings.

+4

nicomen Aug 11 '10 at 16:47

source share

I recently had to parse x509 "Subject" lines. They had a similar form to the one you provided:

 echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu' | \ perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;' C=HU L=Budapest O=Microsec Ltd. CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu

A brief description of the regular expression:

(\w+\=.+?) - capture words followed by '=', and any subsequent characters in non-greedy mode
(?=(?:, \w+\=|$)) - followed by either another , KEY=val , or the end of the line.

The interesting part of the regex used is:

.+? - Unwanted mode
(?:pattern) - No capture mode
(?=pattern) positive statement with zero width expectation

+3

Delian Krustev Feb 23 '12 at 10:14

source share

This will also give you general double-quote escaping, for example var3 = "a, \" b, c ".

 @a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;

In action:

 echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' | perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a' var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello

+2

Hynek -Pichi- Vychodil Aug 11 '10 at 17:18

source share

 #!/usr/bin/perl use strict; use warnings; use Text::ParseWords; use YAML; my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; my @parts = shellwords $string; print Dump \@parts; @parts = map { { split /=/ } } @parts; print Dump \@parts;

+2

Sinan Ünür Aug 11 2018-10-11T00:

source share

You requested a RegEx solution or other code. This solution is (mostly) without regex, using only core modules. The only regular expression \s+ defines a delimiter; in this case, one or more spaces.

 use strict; use warnings; use Text::ParseWords; my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; my @array = quotewords('\s+', 0, $string); for ( my $i = 0; $i < scalar( @array ); $i++ ) { print $i.": ".$array[$i]."\n"; }

Or you can execute the code HERE

Output:

 0: var1=100 1: var2=90 2: var5=hello 3: var3=a, b, c 4: var7=test 5: var3=hello

If you really need a regex solution, Alan Moore's comment related to its IDEone code is gas!

+1

dawg Aug 12 2018-10-12T00:

source share

This can be done using regular expressions, but it is fragile.

 my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x; my @matches = $string =~ /$regexp/g;

0

szbalint Aug 11 '10 at 15:44

source share

jkramer · Accepted Answer · 2010-08-11 15:37

 my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) { print "<$1> => <$2>\n"; }

Print

 <var1> => <100> <var2> => <90> <var5> => <hello> <var3> => <"a, b, c"> <var7> => <test> <var3> => <hello>

Explanation:

Last snippet: the g flag at the end means that you can apply the regular expression to the string multiple times. The second time he will continue to match where the last match ended in a row.

Now for the regular expression: (?:^|\s+) either the beginning of the line or a group of one or more spaces matches. This is necessary, therefore, when the regular expression is applied next time, we will skip the spaces between the key / value pairs. ?: means that the contents of the brackets will not be written as a group (we do not need spaces, only the key and value). \S+ matches the variable name. Then we skip any number of spaces and an equal sign between them. Finally, ("[^"]*"|\S*)/ matches two quotation marks with any number of characters between them or any number of non-spatial characters for the value. Note that quotes matching is rather fragile and will not correctly handle escpaped quotes, for example . "\"quoted\"" will result in a "\" .

EDIT:

Since you really want to get the whole task, and not the individual keys / values, here is one layer that extracts them:

 my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;

Relx Group in Perl: how to capture elements into an array from a regex group that matches an unknown number / multiple / variable occurrences from a string?

More articles: