Removing duplicate pairs from a very large text file

I have a very large text file (several GB) that has the following format:

1 2 3 4 3 5 3 6 3 7 3 8 3 9 

The file is already sorted, and double lines are deleted. There are duplicate pairs, such as "2 1", "4 3" in the reverse order, which I want to delete. Does anyone have a solution to do this in a resource-limited environment, in BASH, AWK, perl, or in any similar languages? I cannot load the whole file and the loop between the values.

+4
source share
7 answers

Possible Solution:

  • Scan file
  • For any pair where the second value is less than the first, replace two numbers
  • Sort pairs again first, then second number
  • Remove duplicates

I'm still thinking of a more efficient solution in terms of disk scans, but this is the main naive approach

+3
source

Do you want to delete lines where the second number is less than the first?

 perl -i~ -lane'print if $F[0] < $F[1]' file 
+4
source

For each value, perform a binary search in a file on your hard drive without loading it into memory. Delete the duplicate if you see it. Then do a final pass that deletes all instances of two or more \n .

+2
source

Not quite sure if this works if it's good ...

 awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext 
+2
source

Do you want to remove duplicates taking into account 1 2 and 2 1 so that they are the same?

 < file.in \ | perl -lane'print "@F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \ | sort -n \ | perl -lane'$t="@F[0,1]"; print "@F[2,3]" if $t ne $p; $p=$t;' \ > file.out 

This can handle arbitrarily large files.

+1
source
 perl -lane ' END{ print for sort {$a<=>$b} keys %h; } $key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]"; $h{$key} = ""; ' file.txt 

Explanations :

  • I am sorting the current row in numerical order
  • I am making the key hash variable $key by concatenating the first and second values ​​with a space
  • I defined $hash{$key} on nothing
  • In the end, I will print all keys sorted in numerical order.

The hash key is uniq in nature, therefore it is not duplicated.

You just need to use Unix redirects to create a new file.

0
source

Here's the general O (n) algorithm for this in 1 pass (no loops or sorting required):

  • Start with an empty hashset as your blacklist (a set is a map with just keys)
  • Read the file one line at a time.
  • For each row:
    • Make sure this pair is already on your blacklist.
    • If so, ignore it.
    • If not, add it to your result file; and add the changed value to the black list (for example, if you just read "3 4" and "4 3" to the black list)

This requires O (n) time, and O (n) for the blacklist. (There is no additional storage for the result if you manage the file as r / w to delete lines when you check them against the blacklist)

0
source

Source: https://habr.com/ru/post/1440655/


All Articles