Selecting the first index after a specific timestamp using pandas TimeSeries

Question

Selecting the first index after a specific timestamp using pandas TimeSeries

This is a two-part question, with the closest question and more general.

I have pandas TimeSeries, ts. Know the first value after a certain time. I can do it,

ts.ix[ts[datetime(2012,1,1,15,0,0):].first_valid_index()]

a) Is there a better, less awkward way to do this?

b) Starting from C, I have a certain phobia when dealing with these somewhat opaque, possibly volatile, but usually not, possibly lazy, but not always types. Therefore, to be clear when I do

 ts[datetime(2012,1,1,15,0,0):].first_valid_index()

ts [datetime (2012,1,1,15,0,0):] is a pandas.TimeSeries object? And I could mutate it.

Does this mean that whenever I take a piece, a copy of ts is allocated in memory? Does this mean that this harmless line of code can actually cause a copy of a gigabyte of TimeSeries to get the index value?

Or maybe they magically exchange memory, and a lazy copy is executed if one of the objects mutates, for example? But then, as you know, what specific operations start the copy? Maybe not slicing, but what about renaming columns? The documentation does not seem to say this. Does it bother you? Should this bother me or should I just learn not to worry and catch problems with the profiler?

+4

python pandas lazy-evaluation

Arthur B. Oct 23 '12 at 22:27

source share

2 answers

I don't know panda, general answer:

You can overload anything in python, and they must have done it there. If you define a special class __getitem__ in your class, it is called when using obj[key] or obj[start:stop] (only with the key as an argument in the first case with a special slice object in the latter). Then you can return whatever you want.

Here is an example showing how __getitem__ works:

 class Foo(object): def __getitem__(self, k): if isinstance(k, slice): return k.start + k.stop # properties of the slice object else: return k

This gives you:

 >>> f = range.Foo() >>> f[42] 42 >>> f[23:42] 65

I assume that in your example, the __getitem__ method returns some special object that contains datetime objects plus a reference to the original ts object. This special object can then use this information to obtain the required information later when the first_valid_index method or the like is called. (It should not even modify the original object, as your question suggested.)

TL DR: learn not to worry :-)

Addition: I was curious, so I implemented a minimal example of the behavior described above:

 class FilterableList(list): def __init__(self, *args): list.__init__(self, *args) self.filter = FilterProxy(self) class FilterProxy(object): def __init__(self, parent): self.parent = parent def __getitem__(self, sl): if isinstance(sl, slice): return Filter(self.parent, sl) class Filter(object): def __init__(self, parent, sl): self.parent = parent self.sl = sl def eval(self): return [e for e in self.parent if self.sl.start <= e <= self.sl.stop] >>> l = FilterableList([4,5,6,7]) >>> f = l.filter[6:10] >>> f.eval() [6, 7] >>> l.append(8) >>> f.eval() [6, 7, 8]

0

Marian Oct 23 '12 at 10:54

source share

Aman · Accepted Answer · 2012-10-23T23:16:18+0000

Some setting:

 In [1]: import numpy as np In [2]: import pandas as pd In [3]: from datetime import datetime In [4]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)] In [5]: ts = pd.Series(np.random.randn(6), index=dates) In [6]: ts Out[6]: 2011-01-02 -0.412335 2011-01-05 -0.809092 2011-01-07 -0.442320 2011-01-08 -0.337281 2011-01-10 0.522765 2011-01-12 1.559876

Well, now, to answer your first question: a) yes, there are less clunky paths, depending on your intention. It is pretty simple:

 In [9]: ts[datetime(2011, 1, 8):] Out[9]: 2011-01-08 -0.337281 2011-01-10 0.522765 2011-01-12 1.559876

This is a slice containing all values after the selected date. You can choose only the first one you like by:

 In [10]: ts[datetime(2011, 1, 8):][0] Out[10]: -0.33728079849770815

To your second question, (b) - this type of indexing is a fragment of the original, like other numpy arrays. This is NOT a copy of the original. See this question or many similar ones: Error or function: cloning a numpy w / slicing array

To demonstrate, change the slice:

 In [21]: ts2 = ts[datetime(2011, 1, 8):] In [23]: ts2[0] = 99

This modifies the original ts time object, since ts2 is a slice, not a copy.

 In [24]: ts Out[24]: 2011-01-02 -0.412335 2011-01-05 -0.809092 2011-01-07 -0.442320 2011-01-08 99.000000 2011-01-10 0.522765 2011-01-12 1.559876

If you need a copy, you can (generally) use the copy method or (in this case) use truncate:

 In [25]: ts3 = ts.truncate(before='2011-01-08') In [26]: ts3 Out[26]: 2011-01-08 99.000000 2011-01-10 0.522765 2011-01-12 1.559876

Changing this copy will not change the original.

 In [27]: ts3[1] = 99 In [28]: ts3 Out[28]: 2011-01-08 99.000000 2011-01-10 99.000000 2011-01-12 1.559876 In [29]: ts #The january 10th value will be unchanged. Out[29]: 2011-01-02 -0.412335 2011-01-05 -0.809092 2011-01-07 -0.442320 2011-01-08 99.000000 2011-01-10 0.522765 2011-01-12 1.559876

This example is directly from Wes' Python for Data Analysis. Check this. It's great.

Selecting the first index after a specific timestamp using pandas TimeSeries

More articles: