How to filter a very, very large file

I have a very large unsorted file, 1000 GB, identification pairs

  • ID: ABC123 ID: ABC124
  • ID: ABC123 ID: ABC124
  • ID: ABC123 ID: ABA122
  • ID: ABC124 ID: ABC123
  • ID: ABC124 ID: ABC126

I would like to filter the file for

1) duplicates

example
ABC123 ABC124
ABC123 ABC124

2) inverse pairs (discard the second occurrence)

example
ABC123 ABC124
ABC124 ABC123

After filtering, the sample file above will look like

  • ID: ABC123 ID: ABC124
  • ID: ABC123 ID: ABA122
  • ID: ABC124 ID: ABC126

Currently my solution is

my %hash;

while(my $line = <FH>){
     chomp $line; #remove \n
     my ($id1,$id2) = split / /, $line;
     if(exists $hash{$id1$1d2} || exists $hash{$id2$id1}){
            next;
     }
     else{
         $hash{$id1$id2} = undef ; ## store it in a hash
         print "$line\n";
      }
}

which gives me the desired results for smaller lists, but takes up too much memory for large lists, since I store the hash in memory.

I am looking for a solution that will require less memory to implement. Some thoughts i have

1) save the hash to a file instead of memory

2) several file passes

3) sorting and deleting a file using unix sort -u -k1,2

After posting to the cs exchange stack, they proposed an external sorting algorithm

+4
6

.

Map-Reduce - , , .

map(id1,id2):
    if id1<id2:
        yield(id1,id2)
   else:
        yield(id2,id1)

reduce(id1,list<ids>):
   ids = hashset(ids) //fairly small per id
   for each id2 in ids:
       yield(id1,id2)

.
( ) , , , .

, ( id )
, .
, , .

, , , , .

Hadoop - , Map-Reduce .

+2

(. ) Bloom filter . . , . , , , () .

- 25 64 - 200 . , . Bloom , .

sortbenchmark.org , , . 2011 66 Quadcore, 24 16 500 1,353 59,2 .

+2

, SQL . " ", 1000 , ...

+2

, , . .

. , , , , , . , ID ( Bash, Python):

with open('input.txt') as file_in, open('reordered.txt', 'w') as file_out:
    for line in file_in:
        reordered = ' '.join(sorted(line.split(' ')))  # reorder IDs
        file_out.write(reordered + '\n')

, . , :

N_PARTS = 1000
with open('reordered.txt') as file_in:
    for line in file_in: 
        part_id = hash(line) % N_PARTS # part_id will be between 0 and (N_PARTS-1)
        with open('part-%8d.txt' % part_id, 'a') as part_file:
            part_file.write(line + '\n')

. Python hash() ( N_PARTS), , , uniform. - , 1 1 1000 ~ 100 . , , .

, , . (, ulimit -f), - , .

100Mb , ? -:

unique = set([])
for i in range(N_PARTS):                          # for each part
    with open('part-%8d.txt') as part_file: 
        file line in part_file:                   # for each line
            unique.add(line)
with open('output.txt', 'w') as file_out:
    for record in unique:
        file_out.write(record + '\n')

- 3 , ( , N_PARTS).

+1

, , , @Tom . Transact SQL , , SQL windowing/ranking row_number() ( MySQL).

, , , id1 id2 , " " id1 id2.

, , .

insert. , . .

CREATE TABLE #TestTable
(
    id int,
    id1 char(6) NOT NULL,
    id2 char(6) NOT NULL
)

insert into 
#TestTable (id, id1, id2) 
values 
    (1, 'ABC123', 'ABC124'),
    (2, 'ABC123', 'ABC124'),
    (3, 'ABC123', 'ABA122'),
    (4, 'ABC124', 'ABC123'),
    (5, 'ABC124', 'ABC126');

select 
    id, 
    (case when id1 <= id2 
        then id1 
        else id2 
    end) id1,
    (case when id1 <= id2 
        then id2 
        else id1 
    end) id2
    into #correctedTable 
from #TestTable

create index idx_id1_id2 on #correctedTable (id1, id2, id)

;with ranked as
(select 
    ROW_NUMBER() over (partition by id1, id2 order by id) dupeRank, 
    id,
    id1,
    id2
 from #correctedTable)

select id, id1, id2 
  from ranked where dupeRank = 1

drop table #correctedTable
drop table #TestTable

:

3           ABA122 ABC123
1           ABC123 ABC124
5           ABC124 ABC126
+1

, 0.02 € .

, . , .

, Merge Sort (, , ), , () .

, . 10 , , .

Linda Operating System / . Tuple Space, / .

- ( , , )

wikipedia Parallel Computing,

, . .

, . . 1990 , . , "" . 1 1 (++ ). 15 . , ~ 15 (I386, I486 320x200 256 ). , , . , .

- " ", , , , .

Such problems have been successfully solved since the very first days of computing. Dates such as B-Tree, tape drive, search time, Fortran, Cobol, IBM AS / 400. If you look like engineers of that time, you will probably come out with something smart :)

EDIT: In fact, you are probably looking for External Sort

0
source

Source: https://habr.com/ru/post/1539984/


All Articles