Fast pandas filtering

I want to filter out the pandas framework if there is an element in this list in the name column entry.

Here we have a DataFrame

x = DataFrame(
    [['sam', 328], ['ruby', 3213], ['jon', 121]], 
    columns=['name', 'score'])

Now let's say that we have a list ['sam', 'ruby'], and we want to find all the lines in which the name is listed, and then summarize the score.

The solution I have is the following:

total = 0
names = ['sam', 'ruby']
for name in names:
     identified = x[x['name'] == name]
     total = total + sum(identified['score'])

However, when the dataframe gets extremely large and the list of names is also very large, everything is very slow.

Is there a faster alternative?

thanks

+4
source share
2 answers

Try using isin (thanks to DSM for the suggestion locinstead of ixhere):

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])

In [79]: names = ['sam', 'ruby']

In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool

In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541

CT Zhu offers a faster alternative using np.in1d:

In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop

In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop
+13

, , , DataFrame . 500%.

, .

df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]], 
                 columns=['name', 'score'])
names = ['sam', 'ruby']

df_searchable = df.set_index('name')

df_searchable[df_searchable.index.isin(names)]
0

Source: https://habr.com/ru/post/1526616/


All Articles