How can I efficiently filter a dictionary with arbitrary length roots as keys?

TL DR

What is the most efficient way to implement a filter function for a dictionary with variable-sized keys? The filter must have a tuple of the same size as the dictionary keys and display all the keys in the dictionary that correspond to the filter in such a way that filter[i] is None or filter[i] == key[i] for all dimensions i .


In my current project, I need to process dictionaries with lots of data. The general structure of the dictionary is such that it contains tuples with 2-4 integers as keys and integers as values. All keys in the dictionary are the same size. To illustrate, the following examples of dictionaries that I need to process:

 {(1, 2): 1, (1, 5): 2} {(1, 5, 3): 2} {(5, 2, 5, 2): 8} 

These dictionaries contain many entries, the largest of which are about 20,000 entries. I often have to filter out these records, but often looking only at certain indices of key tuples. Ideally, I want to have a function to which I can provide a filter tuple. Then the function should return all the keys that match the filter tuple. If the filter tuple contains a None entry, then this will match any value in the root dictionary of the dictionary at this index.

An example of what a function should do for a dictionary with two-dimensional keys:

 >>> dict = {(1, 2): 1, (1, 5): 2, (2, 5): 1, (3, 9): 5} >>> my_filter_fn((1, None)) {(1, 2), (1, 5)} >>> my_filter_fn((None, 5)) {(1, 5), (2, 5)} >>> my_filter_fn((2, 4)) set() >>> my_filter_fn((None, None)) {(1, 2), (1, 5), (2, 5), (3, 9)} 

Since my dictionaries have different sizes of their tuples, I tried to solve this problem by writing a generator expression that takes into account the size of the tuple:

 def my_filter_fn(entries: dict, match: tuple): return (x for x in entries.keys() if all(match[i] is None or match[i] == x[i] for i in range(len(key)))) 

Unfortunately, this is rather slow compared to the manual recording condition ( (match[0] is None or match[0] === x[0]) and (match[1] is None or match[1] == x[1] ); for 4 measurements this is about 10 times slower. This is a problem for me, since I need to do this filtering quite often.

The following code demonstrates a performance issue. Code is simply provided to illustrate the problem and allows you to reproduce tests. You can skip part of the code, the results are below.

 import random import timeit def access_variable_length(): for key in entry_keys: for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i] for i in range(len(key)))): pass def access_static_length(): for key in entry_keys: for k in (x for x in all_entries.keys() if (key[0] is None or x[0] == key[0]) and (key[1] is None or x[1] == key[1]) and (key[2] is None or x[2] == key[2]) and (key[3] is None or x[3] == key[3])): pass def get_rand_or_none(start, stop): number = random.randint(start-1, stop) if number == start-1: number = None return number entry_keys = set() for h in range(100): entry_keys.add((get_rand_or_none(1, 200), get_rand_or_none(1, 10), get_rand_or_none(1, 4), get_rand_or_none(1, 7))) all_entries = dict() for l in range(13000): all_entries[(random.randint(1, 200), random.randint(1, 10), random.randint(1, 4), random.randint(1, 7))] = 1 variable_time = timeit.timeit("access_variable_length()", "from __main__ import access_variable_length", number=10) static_time = timeit.timeit("access_static_length()", "from __main__ import access_static_length", number=10) print("variable length time: {}".format(variable_time)) print("static length time: {}".format(static_time)) 

Results:

  variable length time: 9.625867042849316
 static length time: 1.043319165662158 

I would like to avoid creating three different functions my_filter_fn2 , my_filter_fn3 and my_filter_fn4 to cover all possible sizes of my dictionaries, and then use filtering of static measurements. I know that filtering of variable sizes will always be slower than filtering for fixed measurements, but I hoped that it would be almost 10 times slower. Since I am not a Python expert, I was hoping there was a smart way that I could reformulate the expression of a variable size generator to give me better performance.

What is the most efficient way to filter a huge dictionary, as I described?

+5
source share
4 answers

Thanks for the opportunity to think of tuples in sets and dictionaries. This is a very useful and powerful corner of Python.

Python is interpreted, so if you come from a compiled language, one good rule is to prevent complex nested iterations where you can. If you write complex for loops or concepts, you should always consider whether there is a better way to do this.

List tables ( stuff[i] ) and range (len(stuff)) inefficient and durable in Python and rarely needed. This is more efficient (and more natural) for iteration:

 for item in stuff: do_something(item) 

The following code runs quickly because it uses some of Python's strengths: understanding, dictionaries, sets, and unpacking tuples.

There are iterations, but they are simple and small. There is only one if statement in the entire code and is executed only 4 times per filter operation. It also helps improve performance and makes code reading easier.

Method Explanation ...

Each key from the source data:

 {(1, 4, 5): 1} 

indexed by position and value:

 { (0, 1): (1, 4, 5), (1, 4): (1, 4, 5), (2, 5): (1, 4, 5) } 

(Python elements from zero.)

Indexes are grouped into one large search dictionary, consisting of sets of tuples:

 { (0, 1): {(1, 4, 5), (1, 6, 7), (1, 2), (1, 8), (1, 4, 2, 8), ...} (0, 2): {(2, 1), (2, 2), (2, 4, 1, 8), ...} (1, 4): {(1, 4, 5), (1, 4, 2, 8), (2, 4, 1, 8), ...} ... } 

After this search is built (and it is built very efficiently), filtering is just a set of intersections and a dictionary search, both of which are lightning fast. Filtering takes microseconds even in a large dictionary.

The method processes data with arty 2, 3, or 4 tuples (or any other), but arity_filtered() returns only keys with the same number of members as the filter tuple. Thus, this class gives you the opportunity to filter all the data together or process different sizes of the tuple separately, and little can be done between them in terms of performance.

The synchronization results for a large random data set (11,500 tuples) were 0.30 s for a search, 0.007 seconds for 100 searches.

 from collections import defaultdict import random import timeit class TupleFilter: def __init__(self, data): self.data = data self.lookup = self.build_lookup() def build_lookup(self): lookup = defaultdict(set) for data_item in self.data: for member_ref, data_key in tuple_index(data_item).items(): lookup[member_ref].add(data_key) return lookup def filtered(self, tuple_filter): # initially unfiltered results = self.all_keys() # reduce filtered set for position, value in enumerate(tuple_filter): if value is not None: match_or_empty_set = self.lookup.get((position, value), set()) results = results.intersection(match_or_empty_set) return results def arity_filtered(self, tuple_filter): tf_length = len(tuple_filter) return {match for match in self.filtered(tuple_filter) if tf_length == len(match)} def all_keys(self): return set(self.data.keys()) def tuple_index(item_key): member_refs = enumerate(item_key) return {(pos, val): item_key for pos, val in member_refs} data = { (1, 2): 1, (1, 5): 2, (1, 5, 3): 2, (5, 2, 5, 2): 8 } tests = { (1, 5): 2, (1, None, 3): 1, (1, None): 3, (None, 5): 2, } tf = TupleFilter(data) for filter_tuple, expected_length in tests.items(): result = tf.filtered(filter_tuple) print("Filter {0} => {1}".format(filter_tuple, result)) assert len(result) == expected_length # same arity filtering filter_tuple = (1, None) print('Not arity matched: {0} => {1}' .format(filter_tuple, tf.filtered(filter_tuple))) print('Arity matched: {0} => {1}' .format(filter_tuple, tf.arity_filtered(filter_tuple))) # check unfiltered results return original data set assert tf.filtered((None, None)) == tf.all_keys() >>> python filter.py Filter (1, 5) finds {(1, 5), (1, 5, 3)} Filter (1, None, 3) finds {(1, 5, 3)} Filter (1, None) finds {(1, 2), (1, 5), (1, 5, 3)} Filter (None, 5) finds {(1, 5), (1, 5, 3)} Arity filtering: note two search results only: (1, None) => {(1, 2), (1, 5)} 
+3
source

I made some changes:

  • you donโ€™t need to use the dict.keys method to iterate the keys, iterate through the dict object give us your keys,

  • separate modules are created, this helps to read and modify:

    • preparations.py with helpers for generating test data:

       import random left_ends = [200, 10, 4, 7] def generate_all_entries(count): return {tuple(random.randint(1, num) for num in left_ends): 1 for _ in range(count)} def generate_entry_keys(count): return [tuple(get_rand_or_none(1, num) for num in left_ends) for _ in range(count)] def get_rand_or_none(start, stop): number = random.randint(start - 1, stop) if number == start - 1: number = None return number 
    • functions.py for proven functions,
    • main.py for tests.
  • passing arguments to the function instead of getting them from the global scope, so the given versions of static and variable lengths become

     def access_static_length(all_entries, entry_keys): for key in entry_keys: for k in (x for x in all_entries if (key[0] is None or x[0] == key[0]) and (key[1] is None or x[1] == key[1]) and (key[2] is None or x[2] == key[2]) and (key[3] is None or x[3] == key[3])): pass def access_variable_length(all_entries, entry_keys): for key in entry_keys: for k in (x for x in all_entries if all(key[i] is None or key[i] == x[i] for i in range(len(key)))): pass 
  • using min on the results of timeit.repeat instead of timeit.timeit to get the most representable results (more in this answer ),

  • change entries_keys elements from 10 to 100 (including ends) in increments of 10 ,

  • change all_entries elements from 10000 to 15000 (including ends) in increments of 500 .


But back to the point.

Enhancements

  • We can improve filtering by skipping index checks with None values โ€‹โ€‹in keys

     def access_variable_length_with_skipping_none(all_entries, entry_keys): for key in entry_keys: non_none_indexes = {i for i, value in enumerate(key) if value is not None} for k in (x for x in all_entries.keys() if all(key[i] == x[i] for i in non_none_indexes)): pass 
  • The next suggestion is to use numpy :

     import numpy as np def access_variable_length_numpy(all_entries, entry_keys): keys_array = np.array(list(all_entries)) for entry_key in entry_keys: non_none_indexes = [i for i, value in enumerate(entry_key) if value is not None] non_none_values = [value for i, value in enumerate(entry_key) if value is not None] mask = keys_array[:, non_none_indexes] == non_none_values indexes, _ = np.where(mask) for k in map(tuple, keys_array[indexes]): pass 

Benchmarks

main.py content:

 import timeit from itertools import product number = 5 repeat = 10 for all_entries_count, entry_keys_count in product(range(10000, 15001, 500), range(10, 101, 10)): print('all entries count: {}'.format(all_entries_count)) print('entry keys count: {}'.format(entry_keys_count)) preparation_part = ("from preparation import (generate_all_entries,\n" " generate_entry_keys)\n" "all_entries = generate_all_entries({all_entries_count})\n" "entry_keys = generate_entry_keys({entry_keys_count})\n" .format(all_entries_count=all_entries_count, entry_keys_count=entry_keys_count)) static_time = min(timeit.repeat( "access_static_length(all_entries, entry_keys)", preparation_part + "from functions import access_static_length", repeat=repeat, number=number)) variable_time = min(timeit.repeat( "access_variable_length(all_entries, entry_keys)", preparation_part + "from functions import access_variable_length", repeat=repeat, number=number)) variable_time_with_skipping_none = min(timeit.repeat( "access_variable_length_with_skipping_none(all_entries, entry_keys)", preparation_part + "from functions import access_variable_length_with_skipping_none", repeat=repeat, number=number)) variable_time_numpy = min(timeit.repeat( "access_variable_length_numpy(all_entries, entry_keys)", preparation_part + "from functions import access_variable_length_numpy", repeat=repeat, number=number)) print("static length time: {}".format(static_time)) print("variable length time: {}".format(variable_time)) print("variable length time with skipping `None` keys: {}" .format(variable_time_with_skipping_none)) print("variable length time with numpy: {}" .format(variable_time_numpy)) 

which on my machine with Python 3.6.1 gives:

 all entries count: 10000 entry keys count: 10 static length time: 0.06314293399918824 variable length time: 0.5234129569980723 variable length time with skipping `None` keys: 0.2890012050011137 variable length time with numpy: 0.22945181500108447 all entries count: 10000 entry keys count: 20 static length time: 0.12795891799760284 variable length time: 1.0610534609986644 variable length time with skipping `None` keys: 0.5744297259989253 variable length time with numpy: 0.5105678180007089 all entries count: 10000 entry keys count: 30 static length time: 0.19210158399801003 variable length time: 1.6491422000035527 variable length time with skipping `None` keys: 0.8566724129996146 variable length time with numpy: 0.7363859869983571 all entries count: 10000 entry keys count: 40 static length time: 0.2561357790000329 variable length time: 2.08878050599742 variable length time with skipping `None` keys: 1.1256247100027394 variable length time with numpy: 1.0066140279996034 all entries count: 10000 entry keys count: 50 static length time: 0.32130833200062625 variable length time: 2.6166040710013476 variable length time with skipping `None` keys: 1.4147321179989376 variable length time with numpy: 1.1700750320014777 all entries count: 10000 entry keys count: 60 static length time: 0.38276188999952865 variable length time: 3.153736616997776 variable length time with skipping `None` keys: 1.7147898039984284 variable length time with numpy: 1.4533947029995034 all entries count: 10000 entry keys count: 70 ... all entries count: 15000 entry keys count: 80 static length time: 0.7141444490007416 variable length time: 6.186657476999244 variable length time with skipping `None` keys: 3.376506028998847 variable length time with numpy: 3.1577993860009883 all entries count: 15000 entry keys count: 90 static length time: 0.8115685330012639 variable length time: 7.14327938399947 variable length time with skipping `None` keys: 3.7462387939995097 variable length time with numpy: 3.6140603050007485 all entries count: 15000 entry keys count: 100 static length time: 0.8950150890013902 variable length time: 7.829741768000531 variable length time with skipping `None` keys: 4.1662235900003 variable length time with numpy: 3.914334102999419 

Summary

As we can see, the numpy version is not as good as expected, and this is apparently not a numpy bug.

If we remove the conversion of filtered array entries to tuple using map and just leave

 for k in keys_array[indexes]: ... 

then it will be very fast (faster than the static version of the length), so the problem is converting from numpy.ndarray objects to tuple .

Filtering the None input keys gives us about a 50% increase in speed, so feel free to add it.

+2
source

I donโ€™t have a beautiful answer, but such optimization often makes the code more difficult to read. But if you just need more speed, you can do two things.

First, we can directly eliminate recalculation from within the loop. You say that all entries in each dictionary are the same length, so you can calculate this once, not multiple times in a loop. This cuts about 20% for me:

 def access_variable_length(): try: length = len(iter(entry_keys).next()) except KeyError: return r = list(range(length)) for key in entry_keys: for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i] for i in r)): pass 

Not really, I agree. But we can do it much faster (and even uglier!) By constructing a fixed-length function using eval . Like this:

 def access_variable_length_new(): try: length = len(iter(entry_keys).next()) except KeyError: return func_l = ["(key[{0}] is None or x[{0}] == key[{0}])".format(i) for i in range(length)] func_s = "lambda x,key: " + " and ".join(func_l) func = eval(func_s) for key in entry_keys: for k in (x for x in all_entries.keys() if func(x,key)): pass 

For me it's almost as fast as the static version.

+1
source

Say you have a dictionary - d

 d = {(1,2):3,(1,4):5,(2,4):2,(1,3):4,(2,3):6,(5,1):5,(3,8):5,(3,6):9} 

First you can get the dictionary keys -

 keys = d.keys() => dict_keys([(1, 2), (3, 8), (1, 3), (2, 3), (3, 6), (5, 1), (2, 4), (1, 4)]) 

Now let's define an is_match function that can solve for given two tuples if they are equal or not based on your conditions -
is_match((1,7),(1,None)) , is_match((1,5),(None,5)) and is_match((1,4),(1,4)) will return True , and is_match((1,7),(1,8)) , is_match((4,7),(6,12)) will return False .

 def if_equal(a, b): if a is None or b is None: return True else: if a==b: return True else: return False is_match = lambda a,b: False not in list(map(if_equal, a, b)) tup = (1, None) matched_keys = [key for key in keys if is_match(key, tup)] => [(1, 2), (1, 3), (1, 4)] 
0
source

Source: https://habr.com/ru/post/1268428/


All Articles