What is the fastest way to count the number of words per line in Perl?

I have several functions that I use more than a million times for different texts, which means that small improvements in these functions go to big wins in general. Currently, I have noticed that all my functions, which include word counting, are significantly reduced than anything else, so I think I want to try to make word counting differently.

Basically, what my function does is to capture several objects that have text associated with them, make sure that this text does not match certain patterns, and then count the number of words in that text. Basic version of the function:

my $num_words = 0; for (my $i=$begin_pos; $i<=$end_pos; $i++) { my $text = $self->_getTextFromNode($i); #If it looks like a node full of bogus text, or just a number, remove it. if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; } if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; } if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; } my @text_words = split(/\s+/, $text); $num_words += scalar(@text_words); if ($num_words > 30) { return 30; } } return $num_words; } 

I do a lot of text comparisons similar to what I'm doing here elsewhere in my code, so I assume my problem should be with my word count. Is there a faster way to do this than splitting into \s+ ? If so, what is it and why is it faster (so that I can understand what I'm doing wrong and apply this knowledge to similar problems later).

+6
source share
5 answers

Using a while loop with a regex is the fastest way I've found to count words:

 my $text = 'asdf asdf asdf asdf asdf'; sub count_array { my @text_words = split(/\s+/, $text); scalar(@text_words); } sub count_list { my $x =()= $text =~ /\S+/g; #/ } sub count_while { my $num; $num++ while $text =~ /\S+/g; #/ $num } say count_array; # 5 say count_list; # 5 say count_while; # 5 use Benchmark 'cmpthese'; cmpthese -2 => { array => \&count_array, list => \&count_list, while => \&count_while, } # Rate list array while # list 303674/s -- -22% -55% # array 389212/s 28% -- -42% # while 675295/s 122% 74% -- 

The while loop is faster because no memory is required for each word found. Also, the regex is in a boolean context, which means that it does not need to extract the actual match from the string.

+12
source

If words are separated by single spaces only, space counting is fast.

 sub count1 { my $str = shift; return 1 + ($str =~ tr{ }{ }); } 

updated test:

 my $text = 'asdf asdf asdf asdf asdf'; sub count_array { my @text_words = split(/\s+/, $text); scalar(@text_words); } sub count_list { my $x =()= $text =~ /\S+/g; #/ } sub count_while { my $num; $num++ while $text =~ /\S+/g; #/ $num } sub count_tr { 1 + ($text =~ tr{ }{ }); } say count_array; # 5 say count_list; # 5 say count_while; # 5 say count_tr; # 5 use Benchmark 'cmpthese'; cmpthese -2 => { array => \&count_array, list => \&count_list, while => \&count_while, tr => \&count_tr, } # Rate list while array tr # list 220911/s -- -24% -44% -94% # while 291225/s 32% -- -26% -92% # array 391769/s 77% 35% -- -89% # tr 3720197/s 1584% 1177% 850% -- 
+4
source

Since you limit the number of words to 30, you can return from a function earlier:

 while ($text =~ /\S+/g) { ++$num_words == 30 && return $num_words; } return $num_words; 

Or using split :

 $num_words = () = split /\s+/, $text, 30; 
+2
source

For the sake of correctness, from the answer of aleroot , you probably want to split " " and not the original split /\s+/ to avoid the fencepost error: "Splitting" into "/ \ s + /" is like "split ('')" , except that a zero first field is created in any start space. * This difference will give you one additional word (zero first field, that is) per line.

For speed, since you limit the number of words to 30, you probably want to use the LIMIT * argument: split " ", $str, 30 .

On the other hand, other answers reasonably tell you far from split at all, since you do not need a list of words, just their score.

+2
source

Since you only need the number of words instead of an array of words, it would be nice to avoid using split . Something this might work:

 $num_words += $text =~ s/((^|\s)\S)/$1/g; 

It replaces the work of creating an array of words with the work of substituting each word with itself. You need to compare it to see if it is faster.

EDIT: it could be faster:

 ++$num_words while $text =~ /\S+/g; 
+1
source

Source: https://habr.com/ru/post/888577/


All Articles