Search string in massive files efficiently

I found variants of this idea, but none of them could force me (very new to python) to where I need to be.

Here is the scenario:

  • I have one massive, 27 gigabyte hashfile.txt , consisting of unique lines on separate lines.
  • I need to parse this line by line, looking for a match in another, not very large (~ 800mb) addresses.txt file
  • When a match is found, it must be written in outfile.txt

My current code has been optimized to the best of my ability, but it only gets 150 lines per second. Given that I have over 1.5 billion lines in my hashfile.txt , any optimization will help.

 fin = 'hashed.txt' nonzeros = open('addrOnly.txt', 'r') fout = open('hits.txt', 'w') lines = nonzeros.read() i = 0 count = 0 with open(fin, 'r') as f: for privkey in f: address = privkey.split(", ")[0] if address in lines: fout.write(privkey) i = i+1 if i%100 == 0: count = count + 100 print "Passed: " + str(count) 
+4
source share
2 answers

What you want to implement is perhaps a Rabin-Karp line search . It is very effective at the same time searching for multiple lines in one case.

Additional information on python implementations in this post. efficient substring python

Since you are looking for multiple addresses at the same time, you probably want the hash entries in addresses.txt and compare them with the Rabin-Karp hash right away, for each iteration. Read more about the rolling hash in Rabin-Karp and you will see how it works.

Since Rabin-Karp requires all patterns to be the same length; in practice, all addresses are likely to have some inconsequential length, you can trim them all to the same (not too short) length and use a hash prefix. In addition, you can change the Rabin-Karp hash to be space-invariant and slight differences in the way the addresses are formatted, and define your own string matching in the same way to confirm matches.

+4
source

With such data sizes, I would use the appropriate database. Databases are optimized to quickly process large datasets much better than a Python program that you can write.

Comparing straight strings is expensive. Let the hash of the string, so the full binary index of the hash tree had a very good chance to fit in memory. md5 has 128 bits and is very fast to compute.

First, computre md5 for each record in any file and save them in another text file:

 from hashlib import md5 with open('hashfile.txt') as input: with open('hashfile-md5.txt', 'w') as output: for line in input: value = line.rstrip() # cut '\n' output.write(value) output.write('\t') # let our file be tab-separated output.write(int(value).hexdigest(), 16)) # md5 as long number output.write('\n') 

Repeat the same for address.txt , creating address-md5.txt .

Take Postgresql, mysql or even SQLite (I will use it here) and create two tables and one index.

 $ sqlite3 matching-db.sqlite create table hashfile ( txt varchar(64), -- adjust size to line lengths of hashfile.txt hash number(38) -- enough to contain 128-bit hash ); create table address ( txt varchar(64), -- adjust size to line lengths of address.txt hash number(38) -- enough to contain 128-bit hash ); 

Now download our data. Native database imports are usually much faster than inserting from Python via dbapi.

 .separator \t .import hashfile-md5.txt hashfile .import address-md5.txt address 

Now we can create an index:

 create index x_address_hash on address(hash); 

This is a select statement that will efficiently scan a large hashfile table and look for matching hashes from a small address table. The index will be in RAM all the time (hopefully), as will most of the address table.

 select h.txt from hashfile h, address a where h.hash = a.hash and h.txt = a.txt; 

The idea is that the x_address_hash index will be used to effectively match hashes, and if the hashes match, then the actual text values ​​will be compared.

I have not tried this on 29 MB of data, but on a toy with two lines the example worked :)

+2
source

Source: https://habr.com/ru/post/1469122/


All Articles