With such data sizes, I would use the appropriate database. Databases are optimized to quickly process large datasets much better than a Python program that you can write.
Comparing straight strings is expensive. Let the hash of the string, so the full binary index of the hash tree had a very good chance to fit in memory. md5 has 128 bits and is very fast to compute.
First, computre md5 for each record in any file and save them in another text file:
from hashlib import md5 with open('hashfile.txt') as input: with open('hashfile-md5.txt', 'w') as output: for line in input: value = line.rstrip()
Repeat the same for address.txt , creating address-md5.txt .
Take Postgresql, mysql or even SQLite (I will use it here) and create two tables and one index.
$ sqlite3 matching-db.sqlite create table hashfile ( txt varchar(64),
Now download our data. Native database imports are usually much faster than inserting from Python via dbapi.
.separator \t .import hashfile-md5.txt hashfile .import address-md5.txt address
Now we can create an index:
create index x_address_hash on address(hash);
This is a select statement that will efficiently scan a large hashfile table and look for matching hashes from a small address table. The index will be in RAM all the time (hopefully), as will most of the address table.
select h.txt from hashfile h, address a where h.hash = a.hash and h.txt = a.txt;
The idea is that the x_address_hash index will be used to effectively match hashes, and if the hashes match, then the actual text values will be compared.
I have not tried this on 29 MB of data, but on a toy with two lines the example worked :)