What is the most readable regular expression for extracting a second word without any trailing spaces from a comma-separated string?

I have an array of strings:

@source = ( "something,something2,third" ,"something,something3 ,third" ,"something,something4" ,"something,something 5" # Note the space in the middle of the word ); 

I need a regular expression that will extract the second of the words separated by commas, BUT without trailing spaces, putting these words in an array.

 @expected_result = ("something2","something3","something4","something 5"); 

What is the most readable way to achieve this?

I have 3 possibilities, none of which seem like optimal readability:

  • Pure regex and then grab $ 1

     @result = map { (/[^,]+,([^,]*[^, ]) *(,|$)/ )[0] } @source; 
  • Separate the comma (this is NOT a CSV, so no parsing is required), then crop:

     @result = map { my @s = split(","), $s[1] =~ s/ *$//; $s[1] } @source; 
  • Put split and trim in nested map s

     @result = map { s/ *$//; $_ } map { (split(","))[1] } @source; 

Which one is better? Any other, more readable alternative that I don't think about?

+4
source share
7 answers

Use capture group names and name sub-templates with (DEFINE) to greatly improve readability.

 #! /usr/bin/env perl use strict; use warnings; use 5.10.0; # for named capture buffer and (?&...) my $second_trimmed_field_pattern = qr/ (?&FIRST_FIELD) (?&SEP) (?<f2> (?&SECOND_FIELD)) (?(DEFINE) # The separator is a comma preceded by optional whitespace. # NOTE: the format simple comma separators, NOT full CSV, so # we don't have to worry about processing escapes or quoted # fields. (?<SEP> \s* ,) # A field stops matching as soon as it sees a separator # or end-of-string, so it matches in similar fashion to # a pattern with a non-greedy quantifier. (?<FIELD> (?: (?! (?&SEP) | $) .)+ ) # The first field is anchored at start-of-string. (?<FIRST_FIELD> ^ (?&FIELD)) # The second field looks like any other field. The name # captures our intent for its use in the main pattern. (?<SECOND_FIELD> (?&FIELD)) ) /x; 

In action:

 my @source = ( "something,something2,third" ,"something,something3 ,third" ,"something,something4" ,"something,something 5" # Note the space in the middle of the word ); for (@source) { if (/$second_trimmed_field_pattern/) { print "[$+{f2}]\n"; #print "[$1]\n"; # or do it the old-fashioned way } else { chomp; print "no match for [$_]\n"; } } 

Conclusion:

  [something2]
 [something3]
 [something4]
 [something 5] 

You can express it like older perls. Below I limit the fragments to the lexical sub region to show that they all work together as a unit.

 sub make_second_trimmed_field_pattern { my $sep = qr/ # The separator is a comma preceded by optional whitespace. # NOTE: the format simple comma separators, NOT full CSV, so # we don't have to worry about processing escapes or quoted # fields. \s* , /x; my $field = qr/ # A field stops matching as soon as it sees a separator # or end-of-string, so it matches in similar fashion to # a pattern with a non-greedy quantifier. (?: # the next character to be matched is not the # beginning of a separator sequence or # end-of-string (?! $sep | $ ) # ... so consume it . )+ # ... as many times as possible /x; qr/ ^ $field $sep ($field) /x; } 

Use it as in

 my @source = ...; # same as above my $second_trimmed_field_pattern = make_second_trimmed_field_pattern; for (@source) { if (/$second_trimmed_field_pattern/) { print "[$1]\n"; } else { chomp; print "no match for [$_]\n"; } } 

Conclusion:

  $ perl5.8.8 prog
 [something2]
 [something3]
 [something4]
 [something 5] 
+6
source

Of these possibilities, I believe that No. 2 is the clearest, although I think I will adjust it a bit to include spaces in the split :

 @result = map { my @s = split(/ *(?:,|$)/); $s[1] } @source; 

(In this case, I could write /[ ]*(?:,|$)/ With the no-op character class, so it's a little more noticeable than quantized * .)

Edited to add: Oops, I used to have a stupid mistake when it did not delete the final space due to something like "foo, bar " . Now that I have fixed this error, the result is not so pleasant and simple, and I'm not sure if I recommend it above!

+6
source

I would do:

 my @result = map /,(.*?[^\s,])\s*(?:,|\z)/, @source; 
+1
source

Regular expressions are usually not readable in the usual sense. They are more like a complex mathematical formula. If readability is a problem, consider using comments (regex supports inline comments).

This page gives a good overview: http://www.perl.com/pub/2004/01/16/regexps.html

Example from this page:

 $_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+)\ fox # fox adjective \ (\w+)\ over # fox action verb \ the\ (\w+) dog # dog adjective (?: # whitespace-trimmed comment: \s* \# \s* # whitespace and comment token (.*?) # captured comment text; non-greedy! \s* # any trailing whitespace )? # this is all optional $ # end of line anchor /x; # allow whitespace 

Think about it if readability is a problem, why the hell are you using perl?;)

0
source
 @result = map { /,([^,]*?)\s*(?:,|$)/ } @source; 
0
source

It is best to handle split , which will remove any spaces preceding the comma. Just divide by /\s*(?:,|$)/ , Take the second element of the list, and all the hard work will be done. The full code is as follows:

 use strict; use warnings; use feature 'say'; my @source = ( "something,something2,third", "something,something3 ,third", "something,something4", "something,something 5 ", ); my @result = map { (split /\s*(?:,|$)/)[1] } @source; say "|$_|" for @result; 

OUTPUT

 |something2| |something3| |something4| |something 5| 
0
source

I like your 3 best option. It clearly defines the various steps that you take to β€œselect” the correct data and what additional manipulations you then perform on it.

So, if readability is the criterion: option 3 is the clear winner.

0
source

Source: https://habr.com/ru/post/1403470/


All Articles