Perl s /// g with many alternations slow

There is a script here:

#!/usr/bin/env perl -w
use strict;
use warnings;
use feature 'say';

use DDP;
use String::Random;
use List::Util qw/shuffle uniq/;
use Time::HiRes qw/gettimeofday tv_interval/;

# generate a random word of given length
my $string_gen = String::Random->new;
my @all_words = map { $string_gen->randpattern('c' x (3 .. 8)[int rand 6] ) } 1 .. 100_000;

for (my $keywords_length = 1; $keywords_length <= 20001; $keywords_length += 1000) { 
  # chose 5000 terms and create a string to search in.
  my $story = join ' ', (shuffle(@all_words))[1 .. 5000];

  # get unique keywords from the list of words generated.
  my @unique_keywords_sublist = uniq((shuffle(@all_words))[1 .. $keywords_length]);

  my %rep = map { $_ => '_keyword_' } @unique_keywords_sublist;
  my $compiled_re = qr/(@{[join '|', keys %rep]})/;

  my $t_start = [gettimeofday];
  $story =~ s/$compiled_re/$rep{$1}/g;
  my $t_end = [gettimeofday];

  say sprintf("%6d | %.5f |", $keywords_length, tv_interval($t_start, $t_end));
}

Conclusion:

$  perl main.pl
     1 | 0.00003 |
  1001 | 0.00132 |
  2001 | 0.00183 |
  3001 | 0.00239 |
  4001 | 0.00306 |
  5001 | 0.00360 |
  6001 | 0.00402 |
  7001 | 0.00426 |
  8001 | 0.00481 |
  9001 | 0.00484 |
 10001 | 0.00489 |
 11001 | 0.00514 |
 12001 | 0.00525 |
 13001 | 0.00545 |
 14001 | 0.00604 |
 15001 | 0.00607 |
 16001 | 0.00658 |
 17001 | 0.00653 |
 18001 | 0.00679 |
 19001 | 9.38560 |
 20001 | 9.70753 |

Why since 19001 alternatives in s /// g has the replacement been taking so long (9.3, 9.7 sec)?

For example, in Python, the same example is faster:

Example: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a Output:

Count  | FlashText | Regex
-------------------------------
('1     ', '|', '0.01189  ', '|', '0.00005  ', '|')
('1001  ', '|', '0.01361  ', '|', '0.07609  ', '|')
('2001  ', '|', '0.01391  ', '|', '0.19645  ', '|')
('3001  ', '|', '0.01419  ', '|', '0.28503  ', '|')
('4001  ', '|', '0.01393  ', '|', '0.37710  ', '|')
('5001  ', '|', '0.01401  ', '|', '0.45982  ', '|')
('6001  ', '|', '0.01425  ', '|', '0.54455  ', '|')
('7001  ', '|', '0.01456  ', '|', '0.61070  ', '|')
('8001  ', '|', '0.01449  ', '|', '0.68954  ', '|')
('9001  ', '|', '0.01465  ', '|', '0.76342  ', '|')
('10001 ', '|', '0.01487  ', '|', '1.05101  ', '|')
('11001 ', '|', '0.01504  ', '|', '0.88044  ', '|')
('12001 ', '|', '0.01505  ', '|', '0.95745  ', '|')
('13001 ', '|', '0.01507  ', '|', '1.08424  ', '|')
('14001 ', '|', '0.01627  ', '|', '1.38801  ', '|')
('15001 ', '|', '0.01532  ', '|', '1.20770  ', '|')
('16001 ', '|', '0.01559  ', '|', '1.44000  ', '|')
('17001 ', '|', '0.01558  ', '|', '1.92552  ', '|')
('18001 ', '|', '0.01614  ', '|', '2.65560  ', '|')
('19001 ', '|', '0.01600  ', '|', '2.59729  ', '|')
('20001 ', '|', '0.01595  ', '|', '2.90371  ', '|')

This example is taken from the article https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f

+4
source share

Source: https://habr.com/ru/post/1690684/


All Articles