Optimize pandas query across multiple columns / multi-index

Question

Optimize pandas query across multiple columns / multi-index

I have a very large table (currently 55 million rows, maybe more), and I need to select subsets and perform very simple operations on these subsets many, many times. It seems that pandas might be the best way to do this in python, but I ran into optimization problems.

I tried to create a fake dataset that exactly matches my real dataset (although it is 5-10 times smaller). It is still large, takes up a lot of memory, etc. There are four columns that I query for and two that I use for calculations.

import pandas import numpy as np import timeit n=10000000 mdt = pandas.DataFrame() mdt['A'] = np.random.choice(range(10000,45000,1000), n) mdt['B'] = np.random.choice(range(10,400), n) mdt['C'] = np.random.choice(range(1,150), n) mdt['D'] = np.random.choice(range(10000,45000), n) mdt['x'] = np.random.choice(range(400), n) mdt['y'] = np.random.choice(range(25), n) test_A = 25000 test_B = 25 test_C = 40 test_D = 35000 eps_A = 5000 eps_B = 5 eps_C = 5 eps_D = 5000 f1 = lambda : mdt.query('@ test_A-@eps _A <= A <= @ test_A+@eps _A & ' + '@ test_B-@eps _B <= B <= @ test_B+@eps _B & ' + '@ test_C-@eps _C <= C <= @ test_C+@eps _C & ' + '@ test_D-@eps _D <= D <= @ test_D+@eps _D')

This selects (for my random data set) 1848 rows:

 len(f1()) Out[289]: 1848

And it will take about 0.15 seconds per request:

 timeit.timeit(f1,number=10)/10 Out[290]: 0.10734589099884033

So, I thought I should be able to do better by sorting and indexing the table, right? And I can take advantage of the fact that everything is int, so I can do slices.

 mdt2 = mdt.set_index(['A', 'B', 'C', 'D']).sortlevel() f2 = lambda : mdt2.loc[(slice(test_A-eps_A, test_A+eps_A), slice(test_B-eps_B, test_B+eps_B), slice(test_C-eps_C, test_C+eps_C), slice(test_D-eps_D, test_D+eps_D)), :] len(f2()) Out[299]: 1848

And it takes a lot longer:

 timeit.timeit(f2,number=10)/10 Out[295]: 7.335134506225586

Am I missing something? It seems like I could do something like numpy.searchsorted, but I can't figure out how to do this on multiple columns. Is pandas the wrong choice?

+6

python numpy pandas bigdata

benjamin Jun 05 '15 at 1:53

source share

1 answer

Jeff · Accepted Answer · 2015-06-05T13:09:51+0000

So there are 2 questions here.

This is a trick that makes the syntax a little nicer.

 In [111]: idx = pd.IndexSlice

1) Your .query does not have the correct priority. The & operator has higher precedence than comparison operators, such as <= , and needs parentheses around its left and right operands.

 In [102]: result3 = mdt.query("(@ test_A-@eps _A <= A <= @ test_A+@eps _A) & (@ test_B-@eps _B <= B <= @ test_B+@eps _B) & (@ test_C-@eps _C <= C <= @ test_C+@eps _C) & (@ test_D-@eps _D <= D <= @ test_D+@eps _D)").set_index(['A','B','C','D']).sortlevel()

This is your original request using the MultiIndex slicers.

 In [103]: result1 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]

Here is a chained version of this request. IOW re-selects it in the result set.

 In [104]: result2 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]

Always check the correctness before working on performance.

 In [109]: (result1==result2).all().all() Out[109]: True In [110]: (result1==result3).all().all() Out[110]: True

Performance

.query IMHO will really scale very well and uses multi-core processors. For a large selection of choices, this will be the way to go.

 In [107]: %timeit mdt.query("(@ test_A-@eps _A <= A <= @ test_A+@eps _A) & (@ test_B-@eps _B <= B <= @ test_B+@eps _B) & (@ test_C-@eps _C <= C <= @ test_C+@eps _C) & (@ test_D-@eps _D <= D <= @ test_D+@eps _D)").set_index(['A','B','C','D']).sortlevel() 10 loops, best of 3: 107 ms per loop

2) Original slicing with several indices. There are problems here, see below. I don’t know exactly why this does not work, and we’ll explore this one here

 In [106]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:] 1 loops, best of 3: 4.34 s per loop

Repeated elections make this quite effective. Please note that I usually do not recommend doing this, since you cannot assign it, but that’s normal for this.

 In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:] 10 loops, best of 3: 140 ms per loop

Optimize pandas query across multiple columns / multi-index

More articles: