Python: summarize the values ​​of three-layer dictionaries

Given a dictionary with three key layers, what's the fastest way to summarize values? Here is my current approach:

from collections import defaultdict dicts = [ {'a':{'b':{'c':1}}}, {'a':{'b':{'c':4, 'e':3}}} ] def sum_three_deep_dict_values(dicts): '''Read in two dicts and return a dictionary that contains their outer-joined keys and value sums''' combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) for d in dicts: for w1, val_dict in d.iteritems(): for w2 in val_dict.iterkeys(): for w3 in val_dict[w2].iterkeys(): combined[w1][w2][w3] += d[w1][w2][w3] return combined print sum_three_deep_dict_values(dicts) 

Here the expected result is {'a': {'b': {'c': 5, 'e': 3}}} The goal is to summarize the values ​​for which both dictionaries have the same keys (for example, d[a][b][c] here) and include the remaining pairs of key values ​​from any dictionary in the output dictionary.

There are a number of questions on SO that seem to answer the question: "How to summarize the values ​​of nested dictionaries"? However, after reading them last night, everyone I found used some strange special case or parameter, for example, “merge / ignore the nth key layer” or “apply the if condition in a special place”. So I wanted to raise a simple question: what's the best way to sum the values ​​of bidirectional dictionaries in Python?

+6
source share
2 answers

I think your current approach is generally good. My suggestion would be to eliminate as many dictionary searches as possible. Iterations over keys and values ​​together should be as fast as iterations over simple keys, so you can combine them. And the last call d[w1][w2][w3] not needed if you do this, as well as a temporary key search. So something like this:

 def sum_three_deep_dict_values(dicts): '''Read in two dicts and return a dictionary that contains their outer-joined keys and value sums''' combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) for layer0 in dicts: for k1, layer1 in layer0.iteritems(): for k2, layer2 in layer1.iteritems(): for k3, count in layer2.iteritems(): combined[k1][k2][k3] += count return combined 

I allowed a little change in the name scheme.

If you are still worried about speed after testing the above, you may need to examine other data structures or third-party libraries. But before you do that, try PyPy - I find that it often gives at least 4x speedup on vanilla for loops.

Also test this against your source code. I think my reasoning is higher, but it is still a little hypothetical. I am curious about the suggestions of others. On the scale that you work, this can be a problem! (Out of curiosity, how long does it take you with your current code?)

UPDATE: I tested this and it is really faster, although only with hair:

 >>> %timeit sum_three_deep_original(dicts) 1000 loops, best of 3: 1.38 ms per loop >>> %timeit sum_three_deep_edited(dicts) 1000 loops, best of 3: 1.26 ms per loop 

I assume you need more speed for your application. I tried it with PyPy and I also compiled it using cython (but without any changes or type annotations). PyPy wins at a rate of 66%. Normal python again (with slightly different options):

 :~ $ python -c 'from tdsum import test; test()' 1.63905096054 

Compiled by cython:

 :~ $ python -c 'from tdsum import test; test()' 1.224848032 

And using PyPy:

 :~ $ pypy -c 'from tdsum import test; test()' 0.427165031433 

I would expect that a real version of cython using a custom-built data structure would significantly outperform PyPy. The problem is that you cannot use dict and still get the iteration acceleration that you would like, because cython has to suppress with Python overlay objects. So you have to implement your own hash table!

I often wondered why cython does not provide a solution to this problem; perhaps there is a numpy type that could be used. I will continue to watch!

+3
source

Here we use a solution that uses the smoothing function and the pull-up function for arbitrarily deeply nested tasks. Works for you, but haven't tested it much more:

 from collections import Counter def flatten(d, parent=None): for k, v in d.items(): keys = (k,) if parent is None else parent + (k,) if isinstance(v, dict): yield from flatten(v, keys) else: yield keys, v def puffup(c): top = {} for k, v in c.items(): current = top # reset walk for ki in k[:-1]: if ki not in current: current[ki] = {} current[k[-1]] = v return top dicts = [ {'a':{'b':{'c':1}}}, {'a':{'b':{'c':4, 'e':3}}} ] c = Counter() for d in dicts: c += dict(flatten(d)) print(puffup(c)) # {'a': {'b': {'c': 5, 'e': 3}}} 

I just saw what you are looking for faster. Although it is much more flexible, it is about 2.5 times slower than the previous answer, without significantly pushing away from the inputs.

0
source

Source: https://habr.com/ru/post/987502/


All Articles