Perl regex: How do I search for a file for a multi-line template without reading the entire file into memory?

My question is the same question as How to make re.search or re.match on the entire file without reading it all in memory? but using perl instead of python

Question: I want to be able to run a regular expression for the entire file, but I would like to be able to not read the entire file in memory at once, how can I work with fairly large files in the future, is there a way to do this? Thanks!

Explanation: I cannot read line by line because it can span multiple lines.

Why am I using perl instead of python? I am having problems with python regex that I need to switch to perl. I would install https://pypi.python.org/pypi/regex , but I can’t, since my workstation, of course, doesn’t allow write access to its python installation directory, and I would rather avoid slow electronic mail and email using IT to install them for me and / or deal with further permissions :)

EDIT: examples of templates I'm looking for

assign signal0 = (cond1) ? val1 : (cond2) ? val2 : val3; assign signal1[15:0] = {input1[7:0], input2[7:0]}; assign signal2[34:0] = { 4'b0, subsig0[3:0], subsig1, subsig2, subsig3[18:2], subsig4[5:0] }; 

I am looking for patterns like the above, i.e. assignment of a variable until I see a semicolon. The regular expression will match any of the above, since I don't know if the pattern is multi-line or not. Perhaps something similar to /assign\s+\w+\s+=\s+[^;];/m , that is, until I get a semicolon

EDIT2: From the answers given (thanks!), It seems that decomposing a template into the start, middle, and end sections might be the best strategy, for example. using the range operator, as suggested by some.

+1
source share
4 answers

You can use the range operator to match everything between two patterns when reading in turn:

 use strict; use warnings 'all'; while (<DATA>) { print if /^assign / .. /;/; } __DATA__ foo assign signal0 = (cond1) ? val1 : (cond2) ? val2 : val3; bar assign signal1[15:0] = {input1[7:0], input2[7:0]}; baz assign signal2[34:0] = { 4'b0, subsig0[3:0], subsig1, subsig2, subsig3[18:2], subsig4[5:0] }; qux 

Output:

 assign signal0 = (cond1) ? val1 : (cond2) ? val2 : val3; assign signal1[15:0] = {input1[7:0], input2[7:0]}; assign signal2[34:0] = { 4'b0, subsig0[3:0], subsig1, subsig2, subsig3[18:2], subsig4[5:0] }; 
+4
source

You can set the input separator $/ to a semicolon ; and read line by line. Each line will contain a statement, including an end semicolon. Then the comparison becomes trivial.

+3
source

I can imagine two solutions (without thinking too much, so maybe I'm wrong):

a) Use the maximum number of matching characters, for example 1024. 1) Read twice as many (2048) characters. 2) Try to match. 3) Look forward at 1024 characters. Repeat.

b) Use the start and end patterns that match on the same line. The part between them can be checked later. You can use the Perl trigger operator in this scenario.

Edit: since the question has been updated, solution b) seems good.

The starting pattern will be the assignment, and the ending pattern will be a semicolon. Everything between them can be concatenated, and then checked for validity.

Example:

 my $assignment = ""; while (<>) { if (/assign\s+\w+\s+=/ .. /;/) { $assignment .= $_; } else { if ($assignment =~ /full regex/) { # do something with the match } $assignment = ""; } } 
+2
source

Here is an example of using progressive matching with a matching pattern:

 use feature qw(say); use strict; use warnings; my $pre_match = qr{assign\s+\S+\s+=\s+}; my $regex = qr{($pre_match[^;]+;)}; my $line = ""; my $found_start = 0; while( <DATA> ) { if ( !$found_start && /$pre_match/ ) { $line = ""; $found_start = 1; } if ( $found_start ) { $line .= $_; if ( $line =~ /$regex/ ) { say "Got match: '$1'"; $found_start = 0; $_ = substr $line, $+[0]; redo; } } } __DATA__ assign signal0 = (cond1) ? val1 : (cond2) ? val2 : val3; assign signal1[15:0] = {input1[7:0], input2[7:0]}; assign signal2[34:0] = { 4'b0, subsig0[3:0], subsig1, subsig2, subsig3[18:2], subsig4[5:0] }; 

Output

 Got match: 'assign signal0 = (cond1) ? val1 : (cond2) ? val2 : val3;' Got match: 'assign signal1[15:0] = {input1[7:0], input2[7:0]};' Got match: 'assign signal2[34:0] = { 4'b0, subsig0[3:0], subsig1, subsig2, subsig3[18:2], subsig4[5:0] };' 
+1
source

Source: https://habr.com/ru/post/1274071/


All Articles