What is the most readable regular expression for extracting a second word without any trailing spaces from a comma-separated string?

Question

What is the most readable regular expression for extracting a second word without any trailing spaces from a comma-separated string?

I have an array of strings:

@source = ( "something,something2,third" ,"something,something3 ,third" ,"something,something4" ,"something,something 5" # Note the space in the middle of the word );

I need a regular expression that will extract the second of the words separated by commas, BUT without trailing spaces, putting these words in an array.

 @expected_result = ("something2","something3","something4","something 5");

What is the most readable way to achieve this?

I have 3 possibilities, none of which seem like optimal readability:

Pure regex and then grab $ 1

 @result = map { (/[^,]+,([^,]*[^, ]) *(,|$)/ )[0] } @source;

Separate the comma (this is NOT a CSV, so no parsing is required), then crop:

 @result = map { my @s = split(","), $s[1] =~ s/ *$//; $s[1] } @source;

Put split and trim in nested map s

 @result = map { s/ *$//; $_ } map { (split(","))[1] } @source;

Which one is better? Any other, more readable alternative that I don't think about?

+4

regex perl readability

DVK Mar 26 '12 at 13:47

source share

7 answers

Of these possibilities, I believe that No. 2 is the clearest, although I think I will adjust it a bit to include spaces in the split :

 @result = map { my @s = split(/ *(?:,|$)/); $s[1] } @source;

(In this case, I could write /[ ]*(?:,|$)/ With the no-op character class, so it's a little more noticeable than quantized * .)

Edited to add: Oops, I used to have a stupid mistake when it did not delete the final space due to something like "foo, bar " . Now that I have fixed this error, the result is not so pleasant and simple, and I'm not sure if I recommend it above!

+6

ruakh Mar 26 '12 at 13:59

source share

I would do:

 my @result = map /,(.*?[^\s,])\s*(?:,|\z)/, @source;

+1

ysth Mar 27 '12 at 3:22

source share

Regular expressions are usually not readable in the usual sense. They are more like a complex mathematical formula. If readability is a problem, consider using comments (regex supports inline comments).

This page gives a good overview: http://www.perl.com/pub/2004/01/16/regexps.html

Example from this page:

 $_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+)\ fox # fox adjective \ (\w+)\ over # fox action verb \ the\ (\w+) dog # dog adjective (?: # whitespace-trimmed comment: \s* \# \s* # whitespace and comment token (.*?) # captured comment text; non-greedy! \s* # any trailing whitespace )? # this is all optional $ # end of line anchor /x; # allow whitespace

Think about it if readability is a problem, why the hell are you using perl?;)

0

SpliFF Mar 26 '12 at 13:58

source share

 @result = map { /,([^,]*?)\s*(?:,|$)/ } @source;

0

perreal Mar 26 '12 at 14:11

source share

It is best to handle split , which will remove any spaces preceding the comma. Just divide by /\s*(?:,|$)/ , Take the second element of the list, and all the hard work will be done. The full code is as follows:

 use strict; use warnings; use feature 'say'; my @source = ( "something,something2,third", "something,something3 ,third", "something,something4", "something,something 5 ", ); my @result = map { (split /\s*(?:,|$)/)[1] } @source; say "|$_|" for @result;

OUTPUT

 |something2| |something3| |something4| |something 5|

0

Borodin Mar 26 '12 at 15:10

source share

I like your 3 best option. It clearly defines the various steps that you take to “select” the correct data and what additional manipulations you then perform on it.

So, if readability is the criterion: option 3 is the clear winner.

0

Haf linger Mar 26 '12 at 18:09

source share

Greg bacon · Accepted Answer · 2012-03-26T19:32:54+0000

Use capture group names and name sub-templates with (DEFINE) to greatly improve readability.

 #! /usr/bin/env perl use strict; use warnings; use 5.10.0; # for named capture buffer and (?&...) my $second_trimmed_field_pattern = qr/ (?&FIRST_FIELD) (?&SEP) (?<f2> (?&SECOND_FIELD)) (?(DEFINE) # The separator is a comma preceded by optional whitespace. # NOTE: the format simple comma separators, NOT full CSV, so # we don't have to worry about processing escapes or quoted # fields. (?<SEP> \s* ,) # A field stops matching as soon as it sees a separator # or end-of-string, so it matches in similar fashion to # a pattern with a non-greedy quantifier. (?<FIELD> (?: (?! (?&SEP) | $) .)+ ) # The first field is anchored at start-of-string. (?<FIRST_FIELD> ^ (?&FIELD)) # The second field looks like any other field. The name # captures our intent for its use in the main pattern. (?<SECOND_FIELD> (?&FIELD)) ) /x;

In action:

 my @source = ( "something,something2,third" ,"something,something3 ,third" ,"something,something4" ,"something,something 5" # Note the space in the middle of the word ); for (@source) { if (/$second_trimmed_field_pattern/) { print "[$+{f2}]\n"; #print "[$1]\n"; # or do it the old-fashioned way } else { chomp; print "no match for [$_]\n"; } }

Conclusion:

  [something2]
 [something3]
 [something4]
 [something 5]

You can express it like older perls. Below I limit the fragments to the lexical sub region to show that they all work together as a unit.

 sub make_second_trimmed_field_pattern { my $sep = qr/ # The separator is a comma preceded by optional whitespace. # NOTE: the format simple comma separators, NOT full CSV, so # we don't have to worry about processing escapes or quoted # fields. \s* , /x; my $field = qr/ # A field stops matching as soon as it sees a separator # or end-of-string, so it matches in similar fashion to # a pattern with a non-greedy quantifier. (?: # the next character to be matched is not the # beginning of a separator sequence or # end-of-string (?! $sep | $ ) # ... so consume it . )+ # ... as many times as possible /x; qr/ ^ $field $sep ($field) /x; }

Use it as in

 my @source = ...; # same as above my $second_trimmed_field_pattern = make_second_trimmed_field_pattern; for (@source) { if (/$second_trimmed_field_pattern/) { print "[$1]\n"; } else { chomp; print "no match for [$_]\n"; } }

Conclusion:

  $ perl5.8.8 prog
 [something2]
 [something3]
 [something4]
 [something 5]

What is the most readable regular expression for extracting a second word without any trailing spaces from a comma-separated string?

More articles: