There is a script here:
use strict;
use warnings;
use feature 'say';
use DDP;
use String::Random;
use List::Util qw/shuffle uniq/;
use Time::HiRes qw/gettimeofday tv_interval/;
my $string_gen = String::Random->new;
my @all_words = map { $string_gen->randpattern('c' x (3 .. 8)[int rand 6] ) } 1 .. 100_000;
for (my $keywords_length = 1; $keywords_length <= 20001; $keywords_length += 1000) {
my $story = join ' ', (shuffle(@all_words))[1 .. 5000];
my @unique_keywords_sublist = uniq((shuffle(@all_words))[1 .. $keywords_length]);
my %rep = map { $_ => '_keyword_' } @unique_keywords_sublist;
my $compiled_re = qr/(@{[join '|', keys %rep]})/;
my $t_start = [gettimeofday];
$story =~ s/$compiled_re/$rep{$1}/g;
my $t_end = [gettimeofday];
say sprintf("%6d | %.5f |", $keywords_length, tv_interval($t_start, $t_end));
}
Conclusion:
$ perl main.pl
1 | 0.00003 |
1001 | 0.00132 |
2001 | 0.00183 |
3001 | 0.00239 |
4001 | 0.00306 |
5001 | 0.00360 |
6001 | 0.00402 |
7001 | 0.00426 |
8001 | 0.00481 |
9001 | 0.00484 |
10001 | 0.00489 |
11001 | 0.00514 |
12001 | 0.00525 |
13001 | 0.00545 |
14001 | 0.00604 |
15001 | 0.00607 |
16001 | 0.00658 |
17001 | 0.00653 |
18001 | 0.00679 |
19001 | 9.38560 |
20001 | 9.70753 |
Why since 19001 alternatives in s /// g has the replacement been taking so long (9.3, 9.7 sec)?
For example, in Python, the same example is faster:
Example: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a
Output:
Count | FlashText | Regex
-------------------------------
('1 ', '|', '0.01189 ', '|', '0.00005 ', '|')
('1001 ', '|', '0.01361 ', '|', '0.07609 ', '|')
('2001 ', '|', '0.01391 ', '|', '0.19645 ', '|')
('3001 ', '|', '0.01419 ', '|', '0.28503 ', '|')
('4001 ', '|', '0.01393 ', '|', '0.37710 ', '|')
('5001 ', '|', '0.01401 ', '|', '0.45982 ', '|')
('6001 ', '|', '0.01425 ', '|', '0.54455 ', '|')
('7001 ', '|', '0.01456 ', '|', '0.61070 ', '|')
('8001 ', '|', '0.01449 ', '|', '0.68954 ', '|')
('9001 ', '|', '0.01465 ', '|', '0.76342 ', '|')
('10001 ', '|', '0.01487 ', '|', '1.05101 ', '|')
('11001 ', '|', '0.01504 ', '|', '0.88044 ', '|')
('12001 ', '|', '0.01505 ', '|', '0.95745 ', '|')
('13001 ', '|', '0.01507 ', '|', '1.08424 ', '|')
('14001 ', '|', '0.01627 ', '|', '1.38801 ', '|')
('15001 ', '|', '0.01532 ', '|', '1.20770 ', '|')
('16001 ', '|', '0.01559 ', '|', '1.44000 ', '|')
('17001 ', '|', '0.01558 ', '|', '1.92552 ', '|')
('18001 ', '|', '0.01614 ', '|', '2.65560 ', '|')
('19001 ', '|', '0.01600 ', '|', '2.59729 ', '|')
('20001 ', '|', '0.01595 ', '|', '2.90371 ', '|')
This example is taken from the article https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f