TL DR
What is the most efficient way to implement a filter function for a dictionary with variable-sized keys? The filter must have a tuple of the same size as the dictionary keys and display all the keys in the dictionary that correspond to the filter in such a way that filter[i] is None or filter[i] == key[i] for all dimensions i .
In my current project, I need to process dictionaries with lots of data. The general structure of the dictionary is such that it contains tuples with 2-4 integers as keys and integers as values. All keys in the dictionary are the same size. To illustrate, the following examples of dictionaries that I need to process:
{(1, 2): 1, (1, 5): 2} {(1, 5, 3): 2} {(5, 2, 5, 2): 8}
These dictionaries contain many entries, the largest of which are about 20,000 entries. I often have to filter out these records, but often looking only at certain indices of key tuples. Ideally, I want to have a function to which I can provide a filter tuple. Then the function should return all the keys that match the filter tuple. If the filter tuple contains a None entry, then this will match any value in the root dictionary of the dictionary at this index.
An example of what a function should do for a dictionary with two-dimensional keys:
>>> dict = {(1, 2): 1, (1, 5): 2, (2, 5): 1, (3, 9): 5} >>> my_filter_fn((1, None)) {(1, 2), (1, 5)} >>> my_filter_fn((None, 5)) {(1, 5), (2, 5)} >>> my_filter_fn((2, 4)) set() >>> my_filter_fn((None, None)) {(1, 2), (1, 5), (2, 5), (3, 9)}
Since my dictionaries have different sizes of their tuples, I tried to solve this problem by writing a generator expression that takes into account the size of the tuple:
def my_filter_fn(entries: dict, match: tuple): return (x for x in entries.keys() if all(match[i] is None or match[i] == x[i] for i in range(len(key))))
Unfortunately, this is rather slow compared to the manual recording condition ( (match[0] is None or match[0] === x[0]) and (match[1] is None or match[1] == x[1] ); for 4 measurements this is about 10 times slower. This is a problem for me, since I need to do this filtering quite often.
The following code demonstrates a performance issue. Code is simply provided to illustrate the problem and allows you to reproduce tests. You can skip part of the code, the results are below.
import random import timeit def access_variable_length(): for key in entry_keys: for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i] for i in range(len(key)))): pass def access_static_length(): for key in entry_keys: for k in (x for x in all_entries.keys() if (key[0] is None or x[0] == key[0]) and (key[1] is None or x[1] == key[1]) and (key[2] is None or x[2] == key[2]) and (key[3] is None or x[3] == key[3])): pass def get_rand_or_none(start, stop): number = random.randint(start-1, stop) if number == start-1: number = None return number entry_keys = set() for h in range(100): entry_keys.add((get_rand_or_none(1, 200), get_rand_or_none(1, 10), get_rand_or_none(1, 4), get_rand_or_none(1, 7))) all_entries = dict() for l in range(13000): all_entries[(random.randint(1, 200), random.randint(1, 10), random.randint(1, 4), random.randint(1, 7))] = 1 variable_time = timeit.timeit("access_variable_length()", "from __main__ import access_variable_length", number=10) static_time = timeit.timeit("access_static_length()", "from __main__ import access_static_length", number=10) print("variable length time: {}".format(variable_time)) print("static length time: {}".format(static_time))
Results:
variable length time: 9.625867042849316
static length time: 1.043319165662158
I would like to avoid creating three different functions my_filter_fn2 , my_filter_fn3 and my_filter_fn4 to cover all possible sizes of my dictionaries, and then use filtering of static measurements. I know that filtering of variable sizes will always be slower than filtering for fixed measurements, but I hoped that it would be almost 10 times slower. Since I am not a Python expert, I was hoping there was a smart way that I could reformulate the expression of a variable size generator to give me better performance.
What is the most efficient way to filter a huge dictionary, as I described?