Parsing unsorted data from large fixed-width text

I am mainly a user of Matlab and Perl n00b. This is my first Perl script.

I have a large fixed-width data file that I would like to process in a binary file with a table of contents. My problem is that the data files are quite large and the data parameters are sorted by time. This makes it difficult (at least for me) to parse in Matlab. Therefore, seeing Matlab not so well versed in the text, I thought I would try Perl. I wrote the following code that works ... at least on my small test file. However, it is very painful when I tried it in a real large data file. It has been put together, many examples for various tasks from the Perl web documentation.

Here is a small example of a data file. Note. The real file has about 2000 parameters and is 1-2 GB. Parameters can be text, double or unsigned integers.

Param 1 filter = ALL_VALUES Param 2 filter = ALL_VALUES Param 3 filter = ALL_VALUES Time Name Ty Value ---------- ---------------------- --- ------------ 1.1 Param 1 UI 5 2.23 Param 3 TXT Some Text 1 3.2 Param 1 UI 10 4.5 Param 2 D 2.1234 5.3 Param 1 UI 15 6.121 Param 2 D 3.1234 7.56 Param 3 TXT Some Text 2 

The main logic of my script is as follows:

  • Read to line ---- to create a list of options to retrieve (always has "filter =").
  • Use the string --- to determine the width of the field. It is broken by spaces.
  • For each build time parameter and data array (while nested inside the foreach parameter)
  • In continue write time and data to a binary file. Then write the name, type and offsets in the text table of contents file (used to read the file later in Matlab).

Here is my script:

 #!/usr/bin/perl $lineArg1 = @ARGV[0]; open(INFILE, $lineArg1); open BINOUT, '>:raw', $lineArg1.".bin"; open TOCOUT, '>', $lineArg1.".toc"; my $line; my $data_start_pos; my @param_name; my @template; while ($line = <INFILE>) { chomp $line; if ($line =~ s/\s+filter = ALL_VALUES//) { $line = =~ s/^\s+//; $line =~ s/\s+$//; push @param_name, $line; } elsif ($line =~ /^------/) { @template = map {'A'.length} $line =~ /(\S+\s*)/g; $template[-1] = 'A*'; $data_start_pos = tell INFILE; last; #Reached start of data exit loop } } my $template = "@template"; my @lineData; my @param_data; my @param_time; my $data_type; foreach $current_param (@param_name) { @param_time = (); @param_data = (); seek(INFILE,$data_start_pos,0); #Jump to data start while ($line = <INFILE>) { if($line =~ /$current_param/) { chomp($line); @lineData = unpack $template, $line; push @param_time, @lineData[0]; push @param_data, @lineData[3]; } } # END WHILE <INFILE> } #END FOR EACH NAME continue { $data_type = @lineData[2]; print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time print BINOUT pack('d*', @param_time); #Write TimeStamps print TOCOUT tell(BINOUT).","; #offset to end of time/data start if ($data_type eq "TXT") { print BINOUT pack 'A*', join("\n",@param_data); } elsif ($data_type eq "D") { print BINOUT pack('d*', @param_data); } elsif ($data_type eq "UI") { print BINOUT pack('L*', @param_data); } print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data } close(INFILE); close(BINOUT); close(TOCOUT); 

So, my questions to you, good people on the Internet, are as follows:

  • What am I obviously messing up? Syntax, declaration of variables when I do not need it, etc.
  • This is probably slow (guessing) due to nested loops and re-searching line by line. Is there a better way to restructure loops to extract multiple rows at once?
  • Any other speeding tips you can give?

Edit: I modified the sample text file to illustrate non-integer timestamps, and Param Names may contain spaces.

+4
source share
4 answers

I modified my code to create a Hash as suggested. I did not include the output in the binary, but due to time limitations. In addition, I need to figure out how to reference the hash in order to get the data and pack it into binary files. I donโ€™t think this part should be complicated ... I hope

In the actual data file (~ 350 MB and 2.0 million lines), the following code takes about 3 minutes to build a hash. The CPU usage was 100% on 1 of my cores (nill on the other 3), and the Perl memory usage exceeded about 325 MB ... until it dumped millions of lines into the tooltip. However, the Dump print will be replaced with a binary package.

Please let me know if I am making a rookie mistake.

 #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $lineArg1 = $ARGV[0]; open(INFILE, $lineArg1); my $line; my @param_names; my @template; while ($line = <INFILE>) { chomp $line; #Remove New Line if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list push @param_names, trim($line); } elsif ($line =~ /^----/) { @template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack $template[-1] = 'A*'; my $data_start_pos = tell INFILE; last; #Reached start of data exit loop } } my $size = $#param_names+1; my @getType = ((1) x $size); my $template = "@template"; my @lineData; my %dataHash; my $lineCount = 0; while ($line = <INFILE>) { if ($lineCount % 100000 == 0){ print "On Line: ".$lineCount."\n"; } if ($line =~ /^\d/) { chomp($line); @lineData = unpack $template, $line; my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names); if ($inHeader) { push @{$dataHash{$lineData[1]}{time} }, $lineData[0]; push @{$dataHash{$lineData[1]}{data} }, $lineData[3]; if ($getType[$headerIndex]){ # Things that only need written once $dataHash{$lineData[1]}{type} = $lineData[2]; $getType[$headerIndex] = 0; } } } $lineCount ++; } # END WHILE <INFILE> close(INFILE); print Dumper \%dataHash; #WRITE BINARY FILE and TOC FILE my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_}); open my $binfile, '>:raw', $lineArg1.'.bin'; open my $tocfile, '>', $lineArg1.'.toc'; for my $param (@param_names){ my $data = $dataHash{$param}; my @toc_line = ($param, $data->{type}, tell $binfile ); print {$binfile} $convert{D}->(@{$data->{time}}); push @toc_line, tell $binfile; print {$binfile} $convert{$data->{type}}->(@{$data->{data}}); push @toc_line, tell $binfile; print {$tocfile} join(',',@toc_line,''),"\n"; } sub trim { #Trim leading and trailing white space my (@strings) = @_; foreach my $string (@strings) { $string =~ s/^\s+//; $string =~ s/\s+$//; chomp ($string); } return wantarray ? @strings : $strings[0]; } # END SUB sub findStr { #Return TRUE if string is contained in array. my $searchStr = shift; my $i = 0; foreach ( @_ ) { if ($_ eq $searchStr){ return (1,$i); } $i ++; } return (0,-1); } # END SUB 

The output is as follows:

 $VAR1 = { 'Param 1' => { 'time' => [ '1.1', '3.2', '5.3' ], 'type' => 'UI', 'data' => [ '5', '10', '15' ] }, 'Param 2' => { 'time' => [ '4.5', '6.121' ], 'type' => 'D', 'data' => [ '2.1234', '3.1234' ] }, 'Param 3' => { 'time' => [ '2.23', '7.56' ], 'type' => 'TXT', 'data' => [ 'Some Text 1', 'Some Text 2' ] } }; 

Here is the TOC output file:

 Param 1,UI,0,24,36, Param 2,D,36,52,68, Param 3,TXT,68,84,107, 

Thank you all for your help! This is a great resource!

EDIT: Added code to write Binary and TOC files.

0
source

First, you should always have 'use strict;' and "use warnings"; pragmas in the script .

It seems you need a simple array ( @param_name ) for the link, so loading these values โ€‹โ€‹will be direct as you have it. (again, adding the above pragmas will start showing you errors, including the line $line = =~ s/^\s+//; ;!)

I suggest you read this to understand how you can load the data file into hashes . Once you have developed the hash, you simply read and download the contents of the file data, and then repeat the contents of the hash.

For example, using time as a key for a hash

 %HoH = ( 1 => { name => "Param1", ty => "UI", value => "5", }, 2 => { name => "Param3", ty => "TXT", value => "Some Text 1", }, 3 => { name => "Param1", ty => "UI", value => "10", }, ); 

Before you begin processing, make sure you close INFILE after reading the contents.

So, at the end, you iterate over the hash and reference the array (instead of the contents of the file) for your output entries - I would suggest that it will be much faster.

Let me know if you need more information.

Note: if you go this route, turn on Data: Dumper is a significant help for printing and understanding the data in your hash!

+3
source

It seems to me that embedded spaces can only occur in the last field. This makes it possible to use split '' for this problem.

I assume you are not interested in the headline. In addition, I assume that you want a vector for each parameter and are not interested in timestamps.

To use data file names specified on the command line or passed through standard input, replace <DATA> with <> .

 #!/usr/bin/env perl use strict; use warnings; my %data; $_ = <DATA> until /^-+/; # skip header while (my $line = <DATA>) { $line =~ s/\s+\z//; last unless $line =~ /\S/; my (undef, $param, undef, $value) = split ' ', $line, 4; push @{ $data{ $param } }, $value; } use Data::Dumper; print Dumper \%data; __DATA__ Param1 filter = ALL_VALUES Param2 filter = ALL_VALUES Param3 filter = ALL_VALUES Time Name Ty Value ---------- ---------------------- --- ------------ 1 Param1 UI 5 2 Param3 TXT Some Text 1 3 Param1 UI 10 4 Param2 D 2.1234 5 Param1 UI 15 6 Param2 D 3.1234 7 Param3 TXT Some Text 2 

Output:

  $ VAR1 = {
           'Param2' => [
                         '2.1234',
                         '3.1234'
                       ],
           'Param1' => [
                         '5',
                         'ten',
                         'fifteen'
                       ],
           'Param3' => [
                         'Some Text 1',
                         'Some Text 2'
                       ]
         }; 
+1
source

Firstly, this part of the code forces the input file to be read once for each parameter. It is quite effective.

 foreach $current_param (@param_name) { ... seek(INFILE,$data_start_pos,0); #Jump to data start while ($line = <INFILE>) { ... } ... } 

There is also rarely a reason to use the continue block. This is more style / readability, and then the real problem.


Now let's make it more efficient.

I packed the sections separately so that I could process the string exactly once. To prevent the use of large amounts of RAM, I used File :: Temp to store data until I was ready for it. Then I used File :: Copy to add these sections to the binary.

This is a quick implementation. If I added much more to it, I would share it more than now.

 #!/usr/bin/perl use strict; use warnings; use File::Temp 'tempfile'; use File::Copy 'copy'; use autodie qw':default copy'; use 5.10.1; my $input_filename = shift @ARGV; open my $input, '<', $input_filename; my @param_names; my $template = ''; # stop uninitialized warning my @field_names; my $field_name_line; while( <$input> ){ chomp; next if /^\s*$/; if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){ push @param_names, $param; }elsif( /^[\s-]+$/ ){ my @fields = split /(\s+)/; my $pos = 0; for my $field (@fields){ my $length = length $field; if( substr($field, 0, 1) eq '-' ){ $template .= "\@${pos}A$length "; } $pos += $length; } last; }else{ $field_name_line = $_; } } @field_names = unpack $template, $field_name_line; for( @field_names ){ s(^\s+){}; $_ = lc $_; $_ = 'type' if substr('type', 0, length $_) eq $_; } my %temp_files; for my $param ( @param_names ){ for(qw'time data'){ my $fh = tempfile 'temp_XXXX', UNLINK => 1; binmode $fh, ':raw'; $temp_files{$param}{$_} = $fh; } } my %convert = ( TXT => sub{ pack 'A*', join "\n", @_ }, D => sub{ pack 'd*', @_ }, UI => sub{ pack 'L*', @_ }, ); sub print_time{ my($param,$time) = @_; my $fh = $temp_files{$param}{time}; print {$fh} $convert{D}->($time); } sub print_data{ my($param,$format,$data) = @_; my $fh = $temp_files{$param}{data}; print {$fh} $convert{$format}->($data); } my %data_type; while( my $line = <$input> ){ next if $line =~ /^\s*$/; my %fields; @fields{@field_names} = unpack $template, $line; print_time( @fields{(qw'name time')} ); print_data( @fields{(qw'name type value')} ); $data_type{$fields{name}} //= $fields{type}; } close $input; open my $bin, '>:raw', $input_filename.".bin"; open my $toc, '>', $input_filename.".toc"; for my $param( @param_names ){ my $data_fh = $temp_files{$param}{data}; my $time_fh = $temp_files{$param}{time}; seek $data_fh, 0, 0; seek $time_fh, 0, 0; my @toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) ); copy( $time_fh, $bin, 8*1024 ); close $time_fh; push @toc_line, sysseek($bin, 0, 1); copy( $data_fh, $bin, 8*1024 ); close $data_fh; push @toc_line, sysseek($bin, 0, 1); say {$toc} join ',', @toc_line, ''; } close $bin; close $toc; 
+1
source

Source: https://habr.com/ru/post/1387098/


All Articles