Optimization of partial match dictionary keys

I have a dictionary that uses a 4-tuple as a key. I need to find all the keys in the dictionary that partially match some other tuples. I have code that does this, but it is slow and needs to be optimized.

This is what I need:

Keys: (1, 2, 3, 4) (1, 3, 5, 2) (2, 4, 8, 7) (1, 4, 3, 4) Match: (1, None, 3, None) Result: [(1, 2, 3, 4), (1, 4, 3, 4)] 

Current code:

 def GetTuples(self, keyWords): tuples = [] for k in self.chain.iterkeys(): match = True for i in range(self.order): if keyWords[i] is not None and keyWords[i] != k[i]: match = False break if match is True: tuples.append(k) return tuples 
  • keyWords is a list containing the values ​​I want to map
  • self.chain is a dictionary
  • self.order - tuple size
  • len (keywords) always = len (k)
  • 'None' is considered a wild card
  • The dictionary is quite large (this method takes ~ 800 ms to run and about 300 MB), so space is also a consideration.

Basically, I'm looking for either an optimization of this method, or a better way to store this data.

+6
source share
4 answers

How about using a database?

I prefer SQLite + SQLAlchemy even for simple projects, but just sqlite3 may have a softer learning curve.

Entering an index in each column should take into account speed issues.

+4
source

Perhaps you can speed it up by saving indexes for your keys. Essentially something like this:

 self.indices[2][5] 

will contain set all keys having 5 in the third position of the key.

Then you can simply set the intersection between the corresponding indexes to get a set of keys:

 matching_keys = None for i in range(self.order): if keyWords[i] is not None: if matching_keys is None: matching_keys = self.indices[i][keyWords[i]] else: matching_keys &= self.indices[i][keyWords[i]] matching_keys = list(matching_keys) if matching_keys else [] 
+4
source

You cannot optimize this further if you store your data in a regular dictionary, since it does not provide anything faster than sequential access to all elements of your dictionary in some unpredictable order. This means that your solution is not faster than O(n) .

Now the database. The database is not a universal solution to any (rather complicated) problem. Can you reliably estimate the speed / complexity of such searches for the database? If you go to the bottom of this answer, you'll see that for big data, database performance can be much worse than a smart data structure.

Here you need a manually created data structure. There are many options, it greatly depends on other things that you do with this data. For example: you can save N sets of sorted lists of your keys, sorted by the N -th tuple element. Then you can quickly select N sorted sets of elements matching only one element of the tuple at position N and find their intersection to get the results. This will give an average performance of O(log n)*O(m) , where m is the average number of elements in one subset.

Or you can store items in the kd tree, which means you need to pay the placement price of O(log n) , but you can make requests like the ones above in O(log n) . Here is an example in python using the kd tree implementation from SciPy:

 from scipy.spatial import kdtree import itertools import random random.seed(1) data = list(itertools.permutations(range(10), 4)) random.shuffle(data) data = data[:(len(data)/2)] tree = kdtree.KDTree(data) def match(a, b): assert len(a) == len(b) for i, v in enumerate(a): if v != b[i] and (v is not None) and (b[i] is not None): return False return True def find_like(kdtree, needle): assert len(needle) == kdtree.m def do_find(tree, needle): if hasattr(tree, 'idx'): return list(itertools.ifilter(lambda x: match(needle, x), kdtree.data[tree.idx])) if needle[tree.split_dim] is None: return do_find(tree.less, needle) + do_find(tree.greater, needle) if needle[tree.split_dim] <= tree.split: return do_find(tree.less, needle) else: return do_find(tree.greater, needle) return do_find(kdtree.tree, needle) def find_like_bf(kdtree, needle): assert len(needle) == kdtree.m return list(itertools.ifilter(lambda x: match(needle, x), kdtree.data)) import timeit print "kd tree:" print "%.2f sec" % timeit.timeit("find_like(tree, (1, None, 2, None))", "from __main__ import find_like, tree", number=1000) print "brute force:" print "%.2f sec" % timeit.timeit("find_like_bf(tree, (1, None, 2, None))", "from __main__ import find_like_bf, tree", number=1000) 

And the results of the test run:

 $ python lookup.py kd tree: 0.89 sec brute force: 6.92 sec 

Just for fun, a basic database solution has also been added. The initialization code has changed from above to:

 random.seed(1) data = list(itertools.permutations(range(30), 4)) random.shuffle(data) 

Now the implementation of "database":

 import sqlite3 db = sqlite3.connect(":memory:") db.execute("CREATE TABLE a (x1 INTEGER, x2 INTEGER, x3 INTEGER, x4 INTEGER)") db.execute("CREATE INDEX x1 ON a(x1)") db.execute("CREATE INDEX x2 ON a(x2)") db.execute("CREATE INDEX x3 ON a(x3)") db.execute("CREATE INDEX x4 ON a(x4)") db.executemany("INSERT INTO a VALUES (?, ?, ?, ?)", [[int(x) for x in value] for value in tree.data]) def db_test(): cur = db.cursor() cur.execute("SELECT * FROM a WHERE x1=? AND x3=?", (1, 2)) return cur.fetchall() print "sqlite db:" print "%.2f sec" % timeit.timeit("db_test()", "from __main__ import db_test", number=100) 

And the test results, reduced by 100 runs per benchmark (for a set of keys from a set of 657720):

 $ python lookup.py building tree done in 6.97 sec building db done in 11.59 sec kd tree: 1.90 sec sqlite db: 2.31 sec 

It is also worth mentioning that the construction tree took almost half the time, and then inserted this set of test data into the database.

Full source here: https://gist.github.com/1261449

+4
source

riffing on the amber answer:

 >>> from collections import defaultdict >>> index = defaultdict(lambda:defaultdict(set)) >>> keys = [(1, 2, 3, 4), ... (1, 3, 5, 2), ... (2, 4, 8, 7), ... (1, 4, 3, 4), ... ] >>> for key in keys: ... for i, val in enumerate(key): ... index[i][val].add(key) ... >>> def match(goal): ... res = [] ... for i, val in enumerate(goal): ... if val is not None: ... res.append(index[i][val]) ... return reduce(set.intersection, res) ... >>> match((1, None, 3, None)) set([(1, 4, 3, 4), (1, 2, 3, 4)]) 
+2
source

Source: https://habr.com/ru/post/898527/


All Articles