I think your current approach is generally good. My suggestion would be to eliminate as many dictionary searches as possible. Iterations over keys and values together should be as fast as iterations over simple keys, so you can combine them. And the last call d[w1][w2][w3] not needed if you do this, as well as a temporary key search. So something like this:
def sum_three_deep_dict_values(dicts): '''Read in two dicts and return a dictionary that contains their outer-joined keys and value sums''' combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) for layer0 in dicts: for k1, layer1 in layer0.iteritems(): for k2, layer2 in layer1.iteritems(): for k3, count in layer2.iteritems(): combined[k1][k2][k3] += count return combined
I allowed a little change in the name scheme.
If you are still worried about speed after testing the above, you may need to examine other data structures or third-party libraries. But before you do that, try PyPy - I find that it often gives at least 4x speedup on vanilla for loops.
Also test this against your source code. I think my reasoning is higher, but it is still a little hypothetical. I am curious about the suggestions of others. On the scale that you work, this can be a problem! (Out of curiosity, how long does it take you with your current code?)
UPDATE: I tested this and it is really faster, although only with hair:
>>> %timeit sum_three_deep_original(dicts) 1000 loops, best of 3: 1.38 ms per loop >>> %timeit sum_three_deep_edited(dicts) 1000 loops, best of 3: 1.26 ms per loop
I assume you need more speed for your application. I tried it with PyPy and I also compiled it using cython (but without any changes or type annotations). PyPy wins at a rate of 66%. Normal python again (with slightly different options):
:~ $ python -c 'from tdsum import test; test()' 1.63905096054
Compiled by cython:
:~ $ python -c 'from tdsum import test; test()' 1.224848032
And using PyPy:
:~ $ pypy -c 'from tdsum import test; test()' 0.427165031433
I would expect that a real version of cython using a custom-built data structure would significantly outperform PyPy. The problem is that you cannot use dict and still get the iteration acceleration that you would like, because cython has to suppress with Python overlay objects. So you have to implement your own hash table!
I often wondered why cython does not provide a solution to this problem; perhaps there is a numpy type that could be used. I will continue to watch!