Problem. Given a set of 25,000 integer user identifiers and about one terabyte of single-line JSON-formatted records, load records in which the user ID corresponds to the database.
Only about 1% of all records will correspond to 250,000 user IDs. Instead of decoding each record for a long time, I try to use string matching to determine if the user ID is in raw JSON; if it matches, then JSON is decoded, and the record is checked and then inserted.
The problem is that matching one line of the original JSON with a set containing ~ 250k string entries is slow.
Here is the code:
// get the list of integer user IDs cur.execute('select distinct user_id from users') // load them as text into a set users = set([]) for result in cur.fetchall(): users.add(str(result[0])) // start working on f, the one-json-record-per-line text file for line in f: scanned += 1 if any(user in line for user in users): print "got one!" // decode json // check for correct decoded user ID match // do insert
Am I coming to this right path? What is a faster way to match these strings? Currently, when you are looking for so many user IDs, this controls ~ 2 inputs per second on a 3ghz machine (not very good). When the list of user IDs is very short, it manages ~ 200,000 entries per second.
source share