Removing duplicate pairs from a very large text file

Question

Removing duplicate pairs from a very large text file

I have a very large text file (several GB) that has the following format:

1 2 3 4 3 5 3 6 3 7 3 8 3 9

The file is already sorted, and double lines are deleted. There are duplicate pairs, such as "2 1", "4 3" in the reverse order, which I want to delete. Does anyone have a solution to do this in a resource-limited environment, in BASH, AWK, perl, or in any similar languages? I cannot load the whole file and the loop between the values.

+4

bash awk perl pair

platoali Oct 18 '12 at 21:07

source share

7 answers

Do you want to delete lines where the second number is less than the first?

 perl -i~ -lane'print if $F[0] < $F[1]' file

+4

ikegami Oct 18 '12 at 21:11

source share

For each value, perform a binary search in a file on your hard drive without loading it into memory. Delete the duplicate if you see it. Then do a final pass that deletes all instances of two or more \n .

+2

ninjagecko Oct 18 '12 at 21:11

source share

Not quite sure if this works if it's good ...

 awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext

+2

doubledown Oct 18 '12 at 22:18

source share

Do you want to remove duplicates taking into account 1 2 and 2 1 so that they are the same?

 < file.in \ | perl -lane'print "@F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \ | sort -n \ | perl -lane'$t="@F[0,1]"; print "@F[2,3]" if $t ne $p; $p=$t;' \ > file.out

This can handle arbitrarily large files.

+1

ikegami Oct 18 '12 at 21:27

source share

 perl -lane ' END{ print for sort {$a<=>$b} keys %h; } $key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]"; $h{$key} = ""; ' file.txt

Explanations :

I am sorting the current row in numerical order
I am making the key hash variable $key by concatenating the first and second values with a space
I defined $hash{$key} on nothing
In the end, I will print all keys sorted in numerical order.

The hash key is uniq in nature, therefore it is not duplicated.

You just need to use Unix redirects to create a new file.

0

Gilles quenot Oct 18 '12 at 21:21

source share

Here's the general O (n) algorithm for this in 1 pass (no loops or sorting required):

Start with an empty hashset as your blacklist (a set is a map with just keys)
Read the file one line at a time.
For each row:
- Make sure this pair is already on your blacklist.
- If so, ignore it.
- If not, add it to your result file; and add the changed value to the black list (for example, if you just read "3 4" and "4 3" to the black list)

This requires O (n) time, and O (n) for the blacklist. (There is no additional storage for the result if you manage the file as r / w to delete lines when you check them against the blacklist)

0

sampson-chen Oct 18 '12 at 21:25

source share

Alan · Accepted Answer · 2012-10-18T21:20:58+0000

Possible Solution:

Scan file
For any pair where the second value is less than the first, replace two numbers
Sort pairs again first, then second number
Remove duplicates

I'm still thinking of a more efficient solution in terms of disk scans, but this is the main naive approach

Removing duplicate pairs from a very large text file

More articles: