Rigth job tool: put your records in the database. If you do not already have Postgres or MySQL installed, I would take sqlite.
$ sqlite3 uniqueness.sqlite create table chk ( ident char(n), -- n as in first n characters lineno integer -- for convenience ); ^D
Then I would insert a unique identifier and row number into this table, possibly using a Python script as follows:
import sqlite3 # install pysqlite3 before this n = ... # how many chars are in the key part lineno = 0 conn = sqlite3.connect("uniqueness.sqlite") cur = conn.cursor() with open("giant-file") as input: for line in input: lineno +=1 ident = line[:n] cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno]) cur.close() conn.close()
After that, you can index the table and use SQL:
$ sqlite3 uniqueness.sqlite create index x_ident on chk(ident); -- may take a bit of time -- quickly find duplicates, if any select ident, count(ident) as how_many from chk group by ident having count(ident) > 1; -- find lines of specific violations, if needed select lineno from chk where ident = ...; -- insert a duplicate ident
Yes, I tried most of this code, it should work :)
9000 source share