Pandas: efficient way to get the first row with an element that is less than a given value

Question

Pandas: efficient way to get the first row with an element that is less than a given value

I am wondering if there is an effective way to do this in pandas: given the data frame, what is the first line that is less than the given value? For example, given:

Which first value is less than 4197080? I want it to return only the line number 4197075. The solution is to first filter at 4197080 and then take the last line, but it looks like an extremely slow O (N) operation (first creating a data frame and then taking its last line), whereas binary search will accept O (LogN).

df.addr[ df.addr < 4197080].tail(1)

I timed it, and the creation df.addr[ df.addr < 4197080]more or less takes the same thing as df.addr[ df.addr < 4197080].tail(1)strongly hinting that internally it first creates the whole df.

num = np.random.randint(0, 10**8, 10**6)
num.sort()
df = pd.DataFrame({'addr':num})
df = df.set_index('addr', drop=False)
df = df.sort_index()

:

%timeit df.addr[ df.addr < 57830391].tail(1)
100 loops, best of 3: 7.9 ms per loop

lt :

%timeit df.lt(57830391)[-1:]
1000 loops, best of 3: 853 µs per loop

, :

%timeit bisect(num, 57830391, 0, len(num))
100000 loops, best of 3: 6.53 µs per loop

?

+4

python pandas

Georgios Bitzes 17 . '14 12:55

1

Jeff · Accepted Answer · 2014-06-17T13:07:00+0000

0.14.0

, .

In [16]: s = df['addr']

In [18]: %timeit s[s<5783091]
100 loops, best of 3: 9.01 ms per loop

In [19]: %timeit s[s<5783091].nlargest(1)
100 loops, best of 3: 11 ms per loop

, , , . .copy , inplace.

In [32]: x = np.random.randint(0, 10**8, 10**6)

In [33]: def f(x):
   ....:     x.copy().sort()
   ....:     

In [35]: %timeit f(x)
10 loops, best of 3: 67.2 ms per loop

" ", searchsorted. , numpy (, .values. 0.14.1)

In [41]: %timeit  s.values.searchsorted(5783091)
100000 loops, best of 3: 2.5 µs per loop

Pandas: efficient way to get the first row with an element that is less than a given value

More articles: