Try removing specific columns using splice in Perl

I'm a new Perl newbie looking for help with my first Perl script

I have huge files of 30-50 GB in size, and they are built like this: millions of columns and thousands of rows:

ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 ABCDE 1 2 3 4 5 6 7 8 9 10 

I would like to delete column β€œA” and column β€œC”, then one third of the columns of the number, so column β€œ3” and column β€œ6”, and then column β€œ9” to the end of the file. Space is limited.

My attempt is as follows:

 #!/usr/local/bin/perl use strict; use warnings; my @dataColumns; my $dataColumnCount; if(scalar(@ARGV) != 2){ print "\nNo files supplied, please supply file name\n"; exit; } my $Infile = $ARGV[0]; my $Outfile = $ARGV[1]; open(INFO,$Infile) || die "Could not open $Infile for reading"; open(OUT,">$Outfile") || die "Could not open $Outfile for writing"; while (<INFO>) { chop; @dataColumns = split(" "); $dataColumnCount = @dataColumns + 1; #Now remove the first element of the list shift(@dataColumns); #Now remove the third element (Note that it is now the second - after removal of the first) splice(@dataColumns,1,1); # remove the third element (now the second) #Now remove the 6th (originally the 8th) and every third one thereafter #NB There are now $dataColumnCount-1 columns for (my $i = 5; $i < $dataColumnCount-1; $i = $i + 3 ) { splice($dataColumns; $i; 1); } #Now join the remaining elements of the list back into a single string my $AmendedLine = join(" ",@dataColumns); #Finally print out the line into your new file print OUT "$AmendedLine/n"; } 

But I get some strange errors:

  • He says I don't like my $ 1 in the for loop, I added β€œmine”, which seems to make the error go away, but no one else for the code seems to contain β€œmine”, so I'm not sure what is going on.

The global character "$ i" requires an explicit package name in the line Convertversion2.pl 36. The global character "$ i" requires an explicit name for the package in the line Convertversion2.pl 36. The global character "$ i" requires an explicit name for the package in the line Convertversion2.pl 36 The global character "$ i" requires an explicit package name in the line Convertversion2.pl 36.

  • Another error: syntax error in line Convertversion2.pl 37, next to "@dataColumns;" syntax error in line Convertversion2.pl 37, next to "1" "

I’m not sure how to fix this error, I think I’m almost there, but I don’t know what the syntax error is, I’m not sure how to fix it.

Thanks in advance.

+4
source share
3 answers

After I wrote about this question, the commentator noted that for my test, you can reduce the execution time by 45%. I rephrased his code a bit:

 my @keep; while (<>) { my @data = split; unless (@keep) { @keep = (0, 1, 0, 1, 1); for (my $i = 5; $i < @data; $i += 3) { push @keep, 1, 1, 0; } } my $i = 0; print join(' ', grep $keep[$i++], @data), "\n"; } 

This runs almost half the time when my original solution got:

  $ time ./zz.pl input.data> / dev / null 
 real 0m21.861s
 user 0m21.310s
 sys 0m0.280s 

Now you can get another 45% of the performance using Inline :: C rather dirty:

 #!/usr/bin/env perl use strict; use warnings; use Inline C => <<'END_C' /* This code 'works' only in a limited set of circumstances! Don't expect anything good if you feed it anything other than plain ASCII */ #include <ctype.h> SV * extract_fields(char *line, AV *wanted_fields) { int ch; IV current_field = 0; IV wanted_field = -1; unsigned char *cursor = line; unsigned char *field_begin = line; unsigned char *save_field_begin; STRLEN field_len = 0; IV i_wanted = 0; IV n_wanted = av_len(wanted_fields); AV *ret = newAV(); while (i_wanted <= n_wanted) { SV **p_wanted = av_fetch(wanted_fields, i_wanted, 0); if (!(*p_wanted)) { croak("av_fetch returned NULL pointer"); } wanted_field = SvIV(*p_wanted); while ((ch = *(cursor++))) { if (!isspace(ch)) { continue; } field_len = cursor - field_begin - 1; save_field_begin = field_begin; field_begin = cursor; current_field += 1; if (current_field != wanted_field) { continue; } av_push(ret, newSVpvn(save_field_begin, field_len)); break; } i_wanted += 1; } return newRV_noinc((SV *) ret); } END_C ; 

And here is the part of Perl. Note that we split only once to find out the indices of the fields that need to be saved. As soon as we find out, we pass the rows and indexes (based on 1) to procedure C for slice and cubes.

 my @keep; while (my $line = <>) { unless (@keep) { @keep = (2, 4, 5); my @data = split ' ', $line; push @keep, grep +(($_ - 5) % 3), 6 .. scalar(@data); } my $fields = extract_fields($line, \@keep); print join(' ', @$fields), "\n"; } 
  $ time ./ww.pl input.data> / dev / null 
 real 0m11.539s
 user 0m11.083s
 sys 0m0.300s 

input.data was generated using:

  $ perl -E 'say join ("", "A" .. "ZZZZ") for 1 .. 100'> input.data 

and it has a size of about 225 MB.

+3
source

The code you show does not cause these errors. You don’t have $1 at all, and if you had in mind $i , then your use of this variable is fine. The only syntax error in the string is splice($dataColumns; $i; 1) , which has semicolons instead of commas and uses $dataColumns instead of @dataColumns .

Besides,

  • It is good practice to declare variables as close as possible to your point of use, rather than at the top of the program.

  • Capital letters are commonly used for constants, such as package names. You must use lowercase letters, numbers, and underscores for variables.

  • Did you know that you set $dataColumnCount to a number greater than the number of elements in @dataColumns ?

  • Lately, it is incredulous to use global file descriptors - you should use lexical variables instead.

I suggest this refactoring of your program. It uses autodie to avoid having to check open calls. It creates a list of array indexes that must be deleted as soon as possible: after the number of fields in each row is known after reading the first record. He then deletes them from the end to the back to avoid having to do arithmetic on the indexes, as previous elements are deleted.

 #!/usr/local/bin/perl use strict; use warnings; use autodie; if (@ARGV != 2) { die "\nNo files supplied, please supply file names\n"; } my ($infile, $outfile) = @ARGV; open my $info, '<', $infile; open my $out, '>', $outfile; my @remove; while (<$info>) { my @data = split; unless (@remove) { @remove = (0, 2); for (my $i = 7; $i < @data; $i += 3) { push @remove, $i; } } splice @data, $_, 1 for reverse @remove; print $out join(' ', @data), "\n"; } 
+2
source

While the other answers above work fine, and mine probably offers no benefits, this is another way to achieve the same, while avoiding split :

 #!/usr/local/bin/perl use strict; use warnings; use feature 'say'; my $dir='D:\\'; open my $fh,"<", "$dir\\test.txt" or die; while (<$fh>) { chomp; my @fields = split ' '; print "$fields[0] $fields[2] "; for (my $i=7; $i <= $#fields; $i += 3){ print "$fields[$i] "; } print "\n"; } close $fh; 

Please let me know if this is useless.

0
source

Source: https://habr.com/ru/post/1498509/


All Articles