Search for text for a long list of substrings

Problem. Given a set of 25,000 integer user identifiers and about one terabyte of single-line JSON-formatted records, load records in which the user ID corresponds to the database.

Only about 1% of all records will correspond to 250,000 user IDs. Instead of decoding each record for a long time, I try to use string matching to determine if the user ID is in raw JSON; if it matches, then JSON is decoded, and the record is checked and then inserted.

The problem is that matching one line of the original JSON with a set containing ~ 250k string entries is slow.

Here is the code:

// get the list of integer user IDs cur.execute('select distinct user_id from users') // load them as text into a set users = set([]) for result in cur.fetchall(): users.add(str(result[0])) // start working on f, the one-json-record-per-line text file for line in f: scanned += 1 if any(user in line for user in users): print "got one!" // decode json // check for correct decoded user ID match // do insert 

Am I coming to this right path? What is a faster way to match these strings? Currently, when you are looking for so many user IDs, this controls ~ 2 inputs per second on a 3ghz machine (not very good). When the list of user IDs is very short, it manages ~ 200,000 entries per second.

+4
source share
3 answers

Akho-Korasik seems to have been created for this purpose. There is even a convenient Python module for it (easy_install ahocorasick).

 import ahocorasick # build a match structure print 'init empty tree' tree = ahocorasick.KeywordTree() cur.execute('select distinct user_id from users') print 'add usernames to tree' for result in cur.fetchall(): tree.add(str(result[0])) print 'build fsa' tree.make() for line in f: scanned += 1 if tree.search(line) != None: print "got one!" 

This is approaching ~ 450 records per second.

+3
source

Try to invert the matching algorithm:

 for digit_sequence in re.findall('[0-9]+', line): if digit_sequence in users: ... 
0
source

I am a C ++ freelancer, and my clients are usually startups that have some slow python / java / .net / etc code, and they want it to work faster. Usually I can do it 100 times faster. More recently, I had a similar task: to search for 5 million substrings in terabytes of text data.

I checked a few algorithms. For Aho-Corasick, I used open source http://sourceforge.net/projects/multifast/ . This was not the fastest algorithm. The fastest was my algorithm, which I came up with from a mixture of a hash table and with some ideas taken from the Rabin-Karp search algorithm. This simple algorithm was 5 times faster and used 5 times less memory than AC. The average substring length was 32 bytes. Thus, AC may not be the fastest for this.

0
source

Source: https://habr.com/ru/post/1447020/


All Articles