Search for duplicate records in a large text file

Question

Search for duplicate records in a large text file

I am on a Linux machine (Redhat) and I have an 11GB text file. Each line of the text file contains data for one record, and the first n characters of the line contain a unique record identifier. The file contains just over 27 million records.

I need to verify that there are no multiple entries in the file with the same unique identifier. I also need to perform this process in a text file of 80 GB, so any solution that requires loading the entire file into memory will not be practical.

+4

python linux bash shell

Justin kredible May 02, '13 at 21:09

source share

7 answers

Rigth job tool: put your records in the database. If you do not already have Postgres or MySQL installed, I would take sqlite.

 $ sqlite3 uniqueness.sqlite create table chk ( ident char(n), -- n as in first n characters lineno integer -- for convenience ); ^D

Then I would insert a unique identifier and row number into this table, possibly using a Python script as follows:

 import sqlite3 # install pysqlite3 before this n = ... # how many chars are in the key part lineno = 0 conn = sqlite3.connect("uniqueness.sqlite") cur = conn.cursor() with open("giant-file") as input: for line in input: lineno +=1 ident = line[:n] cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno]) cur.close() conn.close()

After that, you can index the table and use SQL:

 $ sqlite3 uniqueness.sqlite create index x_ident on chk(ident); -- may take a bit of time -- quickly find duplicates, if any select ident, count(ident) as how_many from chk group by ident having count(ident) > 1; -- find lines of specific violations, if needed select lineno from chk where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

+2

9000 May 02, '13 at 22:35

source share

I would never recommend that you try to filter such a massive text file in Python. No matter how you handle it, you need to go through a few difficult steps to make sure you are running out of memory.

The first thing that comes to mind is to create a hash of strings, and then use the hash to find duplicates. Since you store the line number, you can directly compare the text to make sure there are no hash collisions.

But the easiest solution would be to convert the text file to a database, which allows you to quickly sort, search and filter duplicate elements. Then you can re-create the text file if it is really necessary.

0

eandersson May 02, '13 at 21:12

source share

Read large text files in Python, line by line, without loading them into memory

The answer to this question was as follows:

 with open("log.txt") as infile: for line in infile: do_something_with(line)

Perhaps this will help you somehow, good luck.

0

0xhughes May 02, '13 at 21:16

source share

Assuming I can't use the database, I would try something like

 # read the file one line at a time http://stackoverflow.com/a/6475407/322909, #be sure to read the comments keys = set() with open("bigfile.txt") as f: for line in f: key = get_key(line) if key in keys: print "dup" else: keys.add(key)

0

John May 02, '13 at 21:17

source share

Try the following:

 n=unique identifier size cat 11gb_file | cut -c-$n | sort | uniq -cd

This will output any duplicate identifiers and how many times they appear.

0

rzymek May 02, '13 at 21:22

source share

I have not tried this in a file big enough, but ... assuming that the fixed position of n characters is 7 and the lines are no longer than 999 + 7 characters, this can do the job

  awk 'BEGIN{FIELDWIDTHS="7 999"} ! a[$1]++' file > newfile

0

tink May 02, '13 at 21:48

source share

Roland Smith · Accepted Answer · 2013-05-02T21:53:20+0000

Read the file line by line, so you do not need to load it all into memory.

For each line (record), create a sha256 hash file (32 bytes) if your identifier is shorter.

Store the hashes / ids in numpy.array . This is probably the most compact way to store them. 27 million entries times 32 bytes / hash is 864 MB. This should fit into the memory of a decent car these days.

To speed up access, you can use the first one, for example. 2 bytes of a hash as a key collections.defaultdict and put other hashes in the list in value. This will actually create a hash table with codes 65536. For 27e6 records, each bucket will contain an average list of approximately 400 records. This will mean a faster search than a numpy array, but it will use more memory.

 d = collections.defaultdict(list) with open('bigdata.txt', 'r') as datafile: for line in datafile: id = hashlib.sha256(line).digest() # Or id = line[:n] k = id[0:2] v = id[2:] if v in d[k]: print "double found:", id else: d[k].append(v)

Search for duplicate records in a large text file

More articles: