Effective tracking of kk keys of upper dictionary based on value

How do you efficiently track the top-k keys of the dictionary with the highest values, while the dictionary updates its keys?

I tried the naive approach to creating a sorted list from the dictionary after each update (as described in Getting the key with the maximum value in the dictionary? ), However it is very expensive and does not scale .

Real world example:

Counting the frequency of words coming from an endless stream of data. At any time, the program may be asked to report whether the word is in the current upper k most frequent values. How can we effectively execute ?

collections.Counter is too slow

>>> from itertools import permutations >>> from collections import Counter >>> from timeit import timeit >>> c = Counter() >>> for x in permutations(xrange(10), 10): c[x] += 1 >>> timeit('c.most_common(1)', 'from __main__ import c', number=1) 0.7442058258093311 >>> sum(c.values()) 3628800 

It takes almost a second to calculate this value!

I am looking for O (1) time for the most_common() function. This should be feasible due to the presence of a different data structure that only internally stores current top-level positions and tracks the current minimum value.

+4
source share
3 answers

We can implement a class that keeps track of top-k values, since I don't think the standard library has this built-in module. This will be updated in parallel with the main dictionary object (possibly Counter ). You can also use this as an attribute of a subclass of the main dictionary object.

Implementation

 class MostCommon(object): """Keep track the top-k key-value pairs. Attributes: top: Integer representing the top-k items to keep track of. store: Dictionary of the top-k items. min: The current minimum of any top-k item. min_set: Set where keys are counts, and values are the set of keys with that count. """ def __init__(self, top): """Create a new MostCommon object to track key-value paris. Args: top: Integer representing the top-k values to keep track of. """ self.top = top self.store = dict() self.min = None self.min_set = defaultdict(set) def _update_existing(self, key, value): """Update an item that is already one of the top-k values.""" # Currently handle values that are non-decreasing. assert value > self.store[key] self.min_set[self.store[key]].remove(key) if self.store[key] == self.min: # Previously was the minimum. if not self.min_set[self.store[key]]: # No more minimums. del self.min_set[self.store[key]] self.min_set[value].add(key) self.min = min(self.min_set.keys()) self.min_set[value].add(key) self.store[key] = value def __contains__(self, key): """Boolean if the key is one of the top-k items.""" return key in self.store def __setitem__(self, key, value): """Assign a value to a key. The item won't be stored if it is less than the minimum (and the store is already full). If the item is already in the store, the value will be updated along with the `min` if necessary. """ # Store it if we aren't full yet. if len(self.store) < self.top: if key in self.store: # We already have this item. self._update_existing(key, value) else: # Brand new item. self.store[key] = value self.min_set[value].add(key) if value < self.min or self.min is None: self.min = value else: # We're full. The value must be greater minimum to be added. if value > self.min: # New item must be larger than current min. if key in self.store: # We already have this item. self._update_existing(key, value) else: # Brand new item. # Make room by removing one of the current minimums. old = self.min_set[self.min].pop() del self.store[old] # Delete the set if there are no old minimums left. if not self.min_set[self.min]: del self.min_set[self.min] # Add the new item. self.min_set[value].add(key) self.store[key] = value self.min = min(self.min_set.keys()) def __repr__(self): if len(self.store) < 10: store = repr(self.store) else: length = len(self.store) largest = max(self.store.itervalues()) store = '<len={length}, max={largest}>'.format(length=length, largest=largest) return ('{self.__class__.__name__}(top={self.top}, min={self.min}, ' 'store={store})'.format(self=self, store=store)) 

Usage example

 >>> common = MostCommon(2) >>> common MostCommon(top=2, min=None, store={}) >>> common['a'] = 1 >>> common MostCommon(top=2, min=1, store={'a': 1}) >>> 'a' in common True >>> common['b'] = 2 >>> common MostCommon(top=2, min=1, store={'a': 1, 'b': 2}) >>> common['c'] = 3 >>> common MostCommon(top=2, min=2, store={'c': 3, 'b': 2}) >>> 'a' in common False >>> common['b'] = 4 >>> common MostCommon(top=2, min=3, store={'c': 3, 'b': 4}) 

Access after updating values ​​is really O (1)

 >>> counter = Counter() >>> for x in permutations(xrange(10), 10): counter[x] += 1 >>> common = MostCommon(1) >>> for key, value in counter.iteritems(): common[key] = value >>> common MostCommon(top=1, min=1, store={(9, 7, 8, 0, 2, 6, 5, 4, 3, 1): 1}) >>> timeit('repr(common)', 'from __main__ import common', number=1) 1.3251570635475218e-05 

Access is O (1), but when the minimum changes during the call of the given element, which is the operation O (n), where n is the number of upper values. This is still better than Counter , which is O (n) during each access, where n is the size of the entire dictionary!

0
source

collections.Counter.most_common passes all the values, finding the Nth largest, putting them in the heap when it goes (at, I think, O (M log N) time, where M is the total number of dictionary entries).

heapq , as suggested by Wei Yen in the comments, may work fine: in parallel with the dictionary, support heapq of the N largest values ​​and when you change the dict check if the value is there or should be there now. The problem is that, as you already noted, the interface really has no way to change the "priority" (in your case [negative, since it is a mini-heap] number of samples) an existing element.

You can change the corresponding element in place and then run heapq.heapify to restore fragility. It takes a linear passage in heap size (N) to find the corresponding element (unless you are doing an additional account to associate elements with positions, probably not worth it), and another linear passage for re-heapify. In the case when the element was not in the list and now is, you need to add it to the heap, replacing the smallest element (in linear time, with the exception of some additional structure).

The heapq private interface has a _siftdown function that contains this comment:

 # 'heap' is a heap at all indices >= startpos, except possibly for pos. pos # is the index of a leaf with a possibly out-of-order value. Restore the # heap invariant. 

That sounds good! A call to heapq._siftdown(heap, 0, pos_of_relevant_idx) will heapq._siftdown(heap, 0, pos_of_relevant_idx) heap in the N time log. Of course, you need to find the position of the index, which you increase in the first place, which takes linear time. You could maintain a dictionary of elements for indexes to avoid this (while also maintaining a pointer to the position of the smallest element), but then you would either have to copy the source _siftdown or change it to update the dictionary when it swaps or performs a linear transition of time to recreate the dictionary afterwards (but you just tried to avoid linear passages ...).

Be careful, this should work before O (log N) time. It turns out, however, that something is called the Fibonacci heap , which supports all the operations you need in a (amortized) constant time. Unfortunately, this is one of those cases where big-O is not the whole story; the complexity of the Fibonacci heap means that in practice, with the possible exception, perhaps, for very large heaps, they are actually no faster than binary heaps. Also (perhaps β€œtherefore”), there is no standard Python implementation that I found on the fast search engine, although the Boost C ++ libraries include one.

First I tried to use heapq by doing a linear search for the element you are changing and calling _siftdown ; this time is O (N) compared to O (M log N) for the Counter approach. If this turns out to be too slow, you can save an extra index dictionary and create your own version of _siftdown , which will update the dict, which should end in O (log N). If it's still too slow (which I doubt), you can look for the Python shell for the Boost Fibonacci heap (or another implementation), but I really doubt that it will be worth the hassle.

+2
source

Use collections.Counter , it already does this for this example in the real world. Do you have other use cases?

+1
source

Source: https://habr.com/ru/post/1469424/


All Articles