Regular expression: repeating capture groups

I need to parse some tables from an ASCII text file. Here's a partial selection:

QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212 RECKITTBEN 192.50 209.00 192.50 201.80 5.21 34 2850 5.707 RUPALIINS 150.00 159.00 150.00 156.25 6.29 4 80 .125 SALAMCRST 164.00 164.75 163.00 163.25 -.45 80 8250 13.505 SINGERBD 779.75 779.75 770.00 773.00 -.89 8 95 .735 SONARBAINS 68.00 69.00 67.50 68.00 .74 11 3050 2.077 

The table consists of 1 column of text and 8 columns of floating point numbers. I would like to capture each column using regular expressions.

I am new to regular expressions. Here I found an erratic regex pattern:

 (\S+)\s+(\s+[\d\.\-]+){8} 

But the pattern only captures the first and last columns. RegexBuddy also issues the following warning:

You repeated the capture group yourself. The group will only capture the last iteration. Place the capture group around the repeating group to capture all iterations.

I consulted their help file, but I do not know how to solve this.

How can I write each column separately?

+9
c # regex
Jul 03 '10 at 19:35
source share
3 answers

In C # (modified from this example ):

 string input = "QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212"; string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$"; Match match = Regex.Match(input, pattern, RegexOptions.MultiLine); if (match.Success) { Console.WriteLine("Matched text: {0}", match.Value); for (int ctr = 1; ctr < match.Groups.Count; ctr++) { Console.WriteLine(" Group {0}: {1}", ctr, match.Groups[ctr].Value); int captureCtr = 0; foreach (Capture capture in match.Groups[ctr].Captures) { Console.WriteLine(" Capture {0}: {1}", captureCtr, capture.Value); captureCtr++; } } } 

Output:

 Matched text: QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212 ... Group 2: 1.212 Capture 0: 11.00 Capture 1: 11.10 Capture 2: 11.00 ...etc. 
+12
Jul 03 '10 at 19:58
source share

Unfortunately, you need to repeat (…) 8 times to get each column separately.

 ^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$ 

If code is possible, you can first match these numeric columns as a whole

 >>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M) >>> allres = rx1.findall(theAsciiText) 

then separate the columns with spaces

 >>> [[p] + q.split() for p, q in allres] 
+5
Jul 03 '10 at 19:38
source share

If you want to know what the warning is for, this is because your capture group has multiple matches (8, as you pointed out), but the capture variable can only have one value. It is assigned the last value.

As described in question 1313332 , getting these multiple matches is usually not possible using regex, although .NET and Perl 6 have some support for it.

A warning tells you that you can put another group around the entire set, for example:

 (\S+)\s+((\s+[\d\.\-]+){8}) 

Then you can see all the columns, but, of course, they will not be separated. Since they cannot be fixed separately at all, the more common intention is to fix all this, and a warning will help you with that.

+4
Jan 02 '10 at 9:12
source share



All Articles