Perl reads a large file for use with multi-line regex

Question

Perl reads a large file for use with multi-line regex

I have a 4 GB text file with lines with a large variable length, this is just an example file, the production files will be much larger. I need to read a file and apply multi-line regular expression.

What is the best way to read such a large file for multi-line regular expression?

If I read this line by line, I don't think my multi-line regex will work correctly. When I use the read function in the form of 3 arguments, the results of the regular expression change as I change the length size specified in the read statement. I believe the file size makes it too large to read into an array or into memory.

Here is my code

package main; use strict; use warnings; our $VERSION = 1.01; my $buffer; my $INFILE; my $OUTFILE; open $INFILE, '<', ... or die "Bad Input File: $!"; open $OUTFILE, '>',... or die "Bad Output File: $!"; while ( read $INFILE, $buffer, 512 ) { if ($buffer =~ /(?m)(^[^\r\n]*\R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^\r\n]/) { print $OUTFILE $&; print $OUTFILE "\n"; } } close( $INFILE ); close( $OUTFILE ); 1;

Here are some sample data:

 ^%Z("EUD") S %L=%LO,%N="E1" ^%Z("RT") This is data that I don't want the regex to find ^%Z("EXY") X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255 X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%NQ ^%Z("F12") S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0)) S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)="" ^%Z("F2") S %=$H>21549+$H-.1,%Y=%\365.25+141,%=%#365.25\1,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D\29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)

The above lines are conjugated syntactically (lines 1 and 2 go together, 3 and 4, etc.). I need to find specific pairs, in the above data, that all pairs except:

 ^%Z("RT") This is data that I don't want the regex to find

0

regex perl large-files

Intrinsic Mar 17 '17 at 20:32

source share

1 answer

zdim · Answer 1 · 2017-03-17T20:41:40+0000

The question seems to be in the analysis of DSL , and it seems that in the general case, regular expression is not a suitable tool for this. A quick search did not lead to a simple list of accepted approaches, with the exception of the pages of the CPAN modules and the messages like in this article . Finding the best approach is really the first step.

However, below is the answer to the question, as indicated in the header and in the clear description: how to parse a very large file, where the units to be processed are distributed over an unknown number of lines.

Continue building the buffer and checking it. Once you find a match, process and clean it.

For example, apply a string to a variable and check (try matching if you use a regular expression). Keep moving, and as soon as it matches the process and clear the variable.

 my $unit; while (<$fh>) { # chomp; # if suitable $unit .= $_; if ( test_unit($unit) ) { # process ... $unit = undef; } }

test_unit is a code placeholder that will determine if the assembled block should be processed. If it is a regular expression, it can be defined before the loop, my $re = qr/.../; (see qr in perlop ) and then check in a loop with if ($unit =~ $re)

The question says that the processed lines fall in pairs, but the comment explains that subsequent lines are not always connected to each other. Thus, we cannot process pairs of lines.

Perl reads a large file for use with multi-line regex

More articles: