How to remove non-unique lines from a large file using Perl?

Duplicate data deletion using Perl invoked internally through a batch file on Windows A DOS window on Windows is invoked through a batch file. The batch file calls the Perl script, which performs the actions. I have a batch file. Script code I have duplicates of work since the data file is not too big. A problem that needs to be resolved is related to large data files (2 GB or more), while the file size causes a memory error when trying to load a complete file into an array to duplicate data deletion. A memory error occurs in a subroutine: -

@contents_of_the_file = <INFILE>;

(a completely different method is acceptable, if it solves this problem, please suggest). Subprogram: -

sub remove_duplicate_data_and_file
{
 open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
  if ($test ne "YES")
   {
    flock(INFILE,1);
   }
  @contents_of_the_file = <INFILE>;
  if ($test ne "YES")
   {
    flock(INFILE,8);
   }
 close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
 @unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, @contents_of_the_file);

 open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
 if ($test ne "YES")
  {
   flock(OUTFILE,1);
  }
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
  {
   print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
  }
 if ($test ne "YES")
  {
   flock(OUTFILE,8);
  }
}
+3
6

@contents_of_the_file - - %unique_contents_of_the_file @unique_contents_of_the_file. ire_and_curses, , : (1) , ; (2) , -dups ​​ .

. , (Digest:: MD5); , . 3- open(), .

use strict;
use warnings;

use Digest::MD5 qw(md5);

my (%seen, %keep_line_nums);
my $in_file  = 'data.dat';
my $out_file = 'data_no_dups.dat';

open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;

while ( defined(my $line = <$in_handle>) ){
    my $hashed_line = md5($line);
    $keep_line_nums{$.} = 1 unless $seen{$hashed_line};
    $seen{$hashed_line} = 1;
}

seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
    print $out_handle $line if $keep_line_nums{$.};
}    

close $in_handle;
close $out_handle;
+6

, . , , . ...

  • - .
  • .
  • - Perl. .
  • , , .

, . , .

+4

Perl , 2 DOS/Windows.

RAM?

, , .

-, < > - , , , Perl SO.

. , .

.

+2

, , . RAM , , RAM. , .

, SQLite.

#!/usr/bin/perl

use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;

my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  $sha= sha1_base64($line);
  unless($cx->selectrow_array($find, undef, $sha)) {
    $insert->execute($sha, $line_number)}
  $line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  if($next_line_number == $line_number) {
    say $line;
    $next_line_number= $list->fetchrow_array;
    last unless $next_line_number;
  }
  $line_number++;
}
close FILE;
+1

" ", Unix (, Cygwin):

cat infile | sort | uniq > outfile

- Perl - . , infile ( ).

EDIT. , , :

  • INFILE
  • (, # mod 10)
  • , - (, tmp-1 to tmp-10)
  • INFILE
  • tmp- # sortedtmp - #
  • Mergesort sortedtmp- [1-10] (.. 10 ),

, .

Parts 2 and 3 can be replaced with random # instead of the number of mod 10 hashes.

Here's a BigSort script that might help (although I haven't tested it):

# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ   sort -t";" -k 1,1 $1 > $2

BigSort()
{
if [ -s $1 ]; then
  rm $1.split.* > /dev/null 2>&1
  split -l 2500 -a 5 $1 $1.split.
  rm $1.sort > /dev/null 2>&1
  touch $1.sort1
  for FILE in `ls $1.split.*`
  do
    echo "sort $FILE"
    sort -t";" -k 1,1 $FILE > $FILE.sort
    sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
    mv $1.sort2 $1.sort1
  done
  mv $1.sort1 $2
  rm $1.split.* > /dev/null 2>&1
else
  # work for empty file !
  cp $1 $2
fi
} 
0
source

Well, you can use the perl command line replacement native mode.

perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
0
source

Source: https://habr.com/ru/post/1718454/


All Articles