Pandas: speed up df.loc based on re-index values

Question

Pandas: speed up df.loc based on re-index values

I have a pandas DataFrame

import pandas as pd import numpy as np df = pd.DataFrame({ 'x': ['a', 'b', 'c'], 'y': [1, 2, 2], 'z': ['f', 's', 's'] }).set_index('x')

from which I would like to select rows based on index ( x ) values in the select array

 selection = ['a', 'c', 'b', 'b', 'c', 'a']

The correct conclusion can be obtained using df.loc as follows

 out = df.loc[selection]

The problem I'm working with is that df.loc runs quite slowly on large DataFrames (2-7 million rows). Is there any way to speed up this operation? I looked at eval() , but this does not seem to apply to hard-coded lists of index values like this. I also thought about using pd.DataFrame.isin , but this skips the repeat values (only returns a string to a unique element in selection ).

+5

performance python pandas dataframe

philE Oct 26 '15 at 20:02

source share

2 answers

You can try merge :

 df = pd.DataFrame({ 'x': ['a', 'b', 'c'], 'y': [1, 2, 2], 'z': ['f', 's', 's'] }) df1 = pd.DataFrame({'x':selection}) In [21]: pd.merge(df1,df,on='x', how='left') Out[21]: xyz 0 a 1 f 1 c 2 s 2 b 2 s 3 b 2 s 4 c 2 s 5 a 1 f

+2

Colonel beauvel Oct 26 '15 at 20:15

source share

Alex Riley · Accepted Answer · 2015-10-26T20:16:15+0000

You can get decent acceleration using reindex instead of loc :

 df.reindex(selection)

Timing (version 0.17.0):

 >>> selection2 = selection * 100 # a larger list of labels >>> %timeit df.loc[selection2] 100 loops, best of 3: 2.54 ms per loop >>> %timeit df.reindex(selection2) 1000 loops, best of 3: 833 µs per loop

The two methods take different paths (hence the speed difference).

loc creates a new DataFrame by calling get_indexer_non_unique , which is necessarily more complex than a simple get_indexer (used for unique values).

On the other hand, the hard work in reindex seems to be done by the take_* functions in generated.pyx . These features look faster with the goal of creating a new DataFrame.

Pandas: speed up df.loc based on re-index values

More articles: