Ideal data structure with fast search, fast update and easy comparison / sorting

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it as follows:

  • Data arrives, check if it is already present in the data structure (hash equality, not timestamp).
  • If so, update the timestamp to "now."
  • If not, add it to the timestamped set "now"

Periodically, I want to delete and return a list of tuples that exceeds a specific timestamp (I need to update various other elements when they "expire"). The timestamp should not be anything specific (it can be a unix timestamp, a python datetime object, or some other simple hash / string comparison).

I use this to receive incoming data, update it if it is already present, and delete data older than X seconds / minutes.

Several data structures can also be a reasonable proposal (I initially went with the priority queue +, but the priority queue was less optimal for constant updating of values).

Other approaches to achieve the same are welcome. The ultimate goal is to keep track of when the elements are a) new to the system, b) already exist in the system, and c) when they expire.

+6
source share
5 answers

This is a pretty good space. What you need are two structures. You need to tell you something that your key ( hash in your case) is known to the collection. For this, the dict very suitable; we just match hash to timestamp so you can easily find each item. Iterating over elements in order of time is a task especially suitable for Heaps, which are provided by the heapq module. Each time we see a key, we simply add it to our heap, like a tuple (timestamp, hash) .

Unfortunately, there is no way to look at a list with inheritance and delete certain items (because, say, they were updated before the expiration date). We will get around this by simply ignoring the heap entries that have timestamps that differ from the value in the dict.

So, here is a place to start, perhaps you can add methods to the wrapper class to support additional operations or change the way data is stored:

 import heapq class ExpiringCache(object): def __init__(self): self._dict = {} self._heap = [] def add(self, key, expiry): self._dict[key] = expiry heapq.heappush(self._heap, (expiry, key)) def contains(self, key): return key in self._dict def collect(self, maxage): while self._heap and self._heap[0][0] <= maxage: expiry, key = heapq.heappop(self._heap) if self._dict.get(key) == expiry: del self._dict[key] def items(self): return self._dict.items() 

create cache and add some elements

 >>> xc = ExpiringCache() >>> xc.add('apples', 1) >>> xc.add('bananas', 2) >>> xc.add('mangoes', 3) 

re-add an item with an even later expiration

 >>> xc.add('apples', 4) 

collect all "older" than two time blocks

 >>> xc.collect(2) >>> xc.contains('apples') True >>> xc.contains('bananas') False 
+4
source

The closest thing I can think of for one structure with the properties you want is a play tree (with your hash as the key).

By turning recently processed (and therefore updated) nodes to the root, you should get the least recently processed (and, therefore, updated) data on the sheets or grouped in the right subtree.

Finding out the details (and their implementation) remains as an exercise for the reader ...


Cautions:

  • the worst height - and therefore complexity - is linear. This should not happen with a decent hash
  • any read-only operations (i.e., searches that do not update the timestamp) break the relationship between the layout of the layout tree and the timestamp

A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to maintain an up-to-date doubly linked list. Then all you need next to the dict is the head and tail links.

Insert and update are still constant time (hash search + linked list), and walking backward from the tail of the list collecting the oldest hashes is linear.

+3
source

If I misunderstand your question, a simple old dict should be ideal for everything except cleansing. Assuming you are trying to avoid checking the entire dictionary during cleanup, I would suggest storing a second data structure for storing pairs (timestamp, hash) .

This additional data structure can be a plain old list or deque (from the collections module). Perhaps the bisect module may be convenient so that the number of comparisons of timestamps is minimal (as opposed to comparing all timestamps until you reach the cutoff value), but since you still have to iterate through the items you need to clear, smoothing the exact details of what would be the fastest require some testing.

Edit:

For Python 2.7 or 3.1+, you can also use OrderedDict (from the collections module). This is basically a dict with an additional order-saving structure built into the class, so you don't need to implement it yourself. The only problem is that the only order it saves is the insertion order, so for your purpose, instead of just reassigning the existing record to a new timestamp, you need to delete it (using del ) and then assign new record with a new timestamp. However, it saves the O (1) search and eliminates the need to maintain a list of pairs (timestamp, hash) ; when the time comes for cleaning, you can simply scroll through the line through the OrderedDict , deleting the entries until you reach one with a timestamp that is later than your clipping.

+2
source

If you handle random false positives, I think that a flowering filter can satisfy your needs (very fast)

http://en.wikipedia.org/wiki/Bloom_filter

and python implementation: https://github.com/axiak/pybloomfiltermmap

EDIT: After reading the message again, I think it will work, but instead of storing the hashes, just let the flowering filter create the hashes for you. i.e., I think you just want to use bloomfilter as a set of timestamps. I assume that your timestamps may just be a set as you haveh them.

+1
source

A simple hash table or dictionary will be O (1) for check / update / set operations. You can simultaneously store data in a simple time-ordered list, where for cleanup operations. Hold the head and tail pointer so that the insert is also O (1), and deleting was as simple as moving the head to reach the target time and deleting all the entries that you find from the hash.

Overhead is one additional pointer to each data item, and the code is simple:

 insert(key,time,data): existing = MyDictionary.find(key) if existing: existing.mark() node = MyNodeType(data,time) #simple container holding args + 'next' pointer node.next = NULL MyDictionary.insert(key,node) Tail.next = node if Head is NULL: Head = node clean(olderThan): while Head.time < olderThan: n = Head.next if not Head.isMarked(): MyDictionary.remove(Head.key) #else it was already overwritten if Head == Tail: Tail = n Head = n 
0
source

Source: https://habr.com/ru/post/955596/


All Articles