Finding an effective way to store history data

Question

Finding an effective way to store history data

The data is a python dict representing the state of something that slowly changes over time. Values change frequently, usually one or two items at a time. Keys can also change, but this is a rare event. After each change, a new data set is remembered for further study.

The result is a long sequence with increasing timestamps. A very simple example “b” turns on and off and on again:

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

This sequence is very convenient for work, but, obviously, quite inefficient. Almost the same data is copied over and over. A real dict has about 100 items. This is why I am looking for another way to store data history both in memory and on disk.

I am sure it has been many times in the past. Is there a standard / recommended way for this problem? The solution does not have to be perfect. Good enough.

This is what I would do if some soul did not show a better approach. Saving only incremental changes is spatially effective:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, INCREMENTAL, {'b':False}),
(timestamp4, INCREMENTAL, {'b':True}),

However, accessing data is not easy because it needs to be restored in a few steps from the last FULL state. To limit the flaw, each Nth entry will be saved as FULL, and all the rest will be INCREMENTAL.

I would add this slight improvement: adding a link to the same state that is already recorded to prevent duplication:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, SAME_AS, timestamp1),
(timestamp4, SAME_AS, timestamp2),

+4

python storage

VPfB Jul 19 '16 at 18:08

source share

3 answers

jme · Answer 1 · 2016-07-19T18:27:53+0000

, "" . , a, b c. , True. , :

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

a , b 2 4, c .

, , . , , , 1. SciPy.

( ) , .

, , . :

class SparseStates(object):

    def __init__(self, columns):
        self.data = {col: set() for col in columns}

    def __getitem__(self, key):
        row, column = key
        return row in self.data[column]

    def turn_on(self, row, column):
        self.data[column].add(row)

:

>>> states = SparseStates(['a', 'b', 'c'])
>>> states.turn_on(2, 'b')
>>> states.turn_on(4, 'b')
>>> states[2, 'a']
False
>>> states[2, 'b']
True
>>> states.data['a']
{}
>>> states.data['b']
{2, 4}

JL Peyret · Answer 2 · 2016-07-19T19:00:14+0000

, PeopleSoft EFFDT . , .

, , (EFFDT) .

, "a" "b", :

KEY     EFFDT       Active  VALUES
a       2016-07-16  True    False
a       2016-03-20  True    True
a       2016-01-16  True    False

        #note that 2016-11-22 is a future date.  its data will "activate"
        #any time your selection date criteria is >= Nov 22
b       2016-11-22  False   True

b       2016-05-16  True    False
b       2016-01-16  True    True

A B, , :

select * from storage 
where KEY in ('A','B') 
and EFFDT = 
    /* pick the last date that is before the limit date (today */
    (select MAX(EFFDT) 
    from storage sub 
    where sub.key = storage.key
    and   sub.effdt <= '2016-07-19'
    )

EFFDT " , ". "Active" , False.

, , . , , .

{
    #key    timestamp     active values.
    "a" : [("2016-07-16", True, (True,)),
           ("2016-03-20", True, (False,)),
           ("2016-01-16", True, (True,),
          },
    "b" : [("2016-05-16", True, (False,)),
           ("2016-01-16", True, (True,)),
          ],
}

. Active, . .

, 2016-12-24, 'a': (False) SQL:

select * from storage 
where Active = True   /* dont want Active=False data so 
                         we filter the subquery results.*/
and EFFDT = 
    /* pick the last date that is before the limit date (today */
    (select MAX(EFFDT) 
    from storage sub 
    where sub.key = storage.key
    and   sub.effdt <= '2016-12-24'
    )

Ohumeronen · Answer 3 · 2016-07-19T18:36:35+0000

pickle . . , . , .

If you want to make sure that two identical configurations of your dictionary are stored in the same pickled file, you can additionally calculate the hash value over the string representation of the dictionary.

Finding an effective way to store history data

More articles: