Pandas: speed up df.loc based on re-index values

I have a pandas DataFrame

import pandas as pd import numpy as np df = pd.DataFrame({ 'x': ['a', 'b', 'c'], 'y': [1, 2, 2], 'z': ['f', 's', 's'] }).set_index('x') 

from which I would like to select rows based on index ( x ) values ​​in the select array

 selection = ['a', 'c', 'b', 'b', 'c', 'a'] 

The correct conclusion can be obtained using df.loc as follows

 out = df.loc[selection] 

The problem I'm working with is that df.loc runs quite slowly on large DataFrames (2-7 million rows). Is there any way to speed up this operation? I looked at eval() , but this does not seem to apply to hard-coded lists of index values ​​like this. I also thought about using pd.DataFrame.isin , but this skips the repeat values ​​(only returns a string to a unique element in selection ).

+5
source share
2 answers

You can get decent acceleration using reindex instead of loc :

 df.reindex(selection) 

Timing (version 0.17.0):

 >>> selection2 = selection * 100 # a larger list of labels >>> %timeit df.loc[selection2] 100 loops, best of 3: 2.54 ms per loop >>> %timeit df.reindex(selection2) 1000 loops, best of 3: 833 Β΅s per loop 

The two methods take different paths (hence the speed difference).

loc creates a new DataFrame by calling get_indexer_non_unique , which is necessarily more complex than a simple get_indexer (used for unique values).

On the other hand, the hard work in reindex seems to be done by the take_* functions in generated.pyx . These features look faster with the goal of creating a new DataFrame.

+4
source

You can try merge :

 df = pd.DataFrame({ 'x': ['a', 'b', 'c'], 'y': [1, 2, 2], 'z': ['f', 's', 's'] }) df1 = pd.DataFrame({'x':selection}) In [21]: pd.merge(df1,df,on='x', how='left') Out[21]: xyz 0 a 1 f 1 c 2 s 2 b 2 s 3 b 2 s 4 c 2 s 5 a 1 f 
+2
source

Source: https://habr.com/ru/post/1234556/


All Articles