Why does Pandas.loc speed in Pandas depend on the initialization of the DataFrame? How to make MultiIndex.loc as fast as possible?

Question

Why does Pandas.loc speed in Pandas depend on the initialization of the DataFrame? How to make MultiIndex.loc as fast as possible?

I am trying to improve code performance. I am using Pandas 0.19.2 and Python 3.5.

I just realized that writing .loc on a whole set of values at a time has a very different speed depending on the initialization of the file system.

Can someone explain why and tell me what is the best initialization? This would allow me to speed up my code.

Here is an example of a toy. I create "similar" data.

import pandas as pd import numpy as np ncols = 1000 nlines = 1000 columns = pd.MultiIndex.from_product([[0], [0], np.arange(ncols)]) lines = pd.MultiIndex.from_product([[0], [0], np.arange(nlines)]) #df has multiindex df = pd.DataFrame(columns = columns, index = lines) #df2 has mono-index, and is initialized a certain way df2 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines)) for i in range(ncols): df2[i] = i*np.arange(nlines) #df3 is mono-index and not initialized df3 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines)) #df4 is mono-index and initialized another way compared to df2 df4 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines)) for i in range(ncols): df4[i] = i

Then I have their time:

 %timeit df.loc[(0, 0, 0), (0, 0)] = 2*np.arange(ncols) 1 loop, best of 3: 786 ms per loop The slowest run took 69.10 times longer than the fastest. This could mean that an intermediate result is being cached. %timeit df2.loc[0] = 2*np.arange(ncols) 1000 loops, best of 3: 275 µs per loop %timeit df3.loc[0] = 2*np.arange(ncols) 10 loops, best of 3: 31.4 ms per loop %timeit df4.loc[0] = 2*np.arange(ncols) 10 loops, best of 3: 63.9 ms per loop

Did I do something wrong???? Why is df2 so much faster than others? In fact, in the multi-index case, it’s much faster to set items one by one using .at. I implemented this solution in my code, but I am not happy with this, I think there should be a better solution. I would rather save my beautiful multi-index data frames, but if I really need to switch to a mono-index, I will.

 def mod(df, arr, ncols): for j in range(ncols): df.at[(0, 0, 0),(0, 0, j)] = arr[j] return df %timeit mod(df, np.arange(ncols), ncols) The slowest run took 10.44 times longer than the fastest. This could mean that an intermediate result is being cached. 100 loops, best of 3: 14.6 ms per loop

+5

performance python pandas

tk. Jan 20 '17 at 22:27

source share

1 answer

John · Accepted Answer · 2017-01-21T05:08:58+0000

One difference that I see here is that you (effectively) initialized df2 and df4 with dtype = int64, but df and df3 with dtype = object. You can initialize empty real values like this for df2 and df4:

 #df has multiindex df = pd.DataFrame(np.empty([ncols,nlines]), columns = columns, index = lines) #df3 is mono-index and not initialized df3 = pd.DataFrame(np.empty([ncols,nlines]), columns = np.arange(ncols), index = np.arange(nlines))

You can also add dtype=int to initialize as integers, and not for realities, but that doesn't look like speed.

I get a much faster timing than you did for df4 (no difference in code), so it's a mystery to me. In any case, with the above changes to df and df3, the timings are close for df2 to df4, but unfortunately df is still pretty slow.

 %timeit df.loc[(0, 0, 0), (0, 0)] = 2*np.arange(ncols) 1 loop, best of 3: 418 ms per loop %timeit df2.loc[:,0] = 2*np.arange(ncols) 10000 loops, best of 3: 185 µs per loop %timeit df3.loc[0] = 2*np.arange(ncols) 10000 loops, best of 3: 116 µs per loop %timeit df4.loc[:,0] = 2*np.arange(ncols) 10000 loops, best of 3: 196 µs per loop

Edit to add:

How much is your big problem with multi-index, I don't know, but 2 thoughts:

1) Extending @ptrj's comment, I get a very quick time for suggesting it (about the same as simple index methods):

 %timeit df.loc[(0, 0, 0) ] = 2*np.arange(ncols) 10000 loops, best of 3: 133 µs per loop

So, again I get from you a completely different time (?). And FWIW, when you need an entire row with loc / iloc, it is recommended to use : rather than leaving the column reference empty:

 timeit df.loc[(0, 0, 0), : ] = 2*np.arange(ncols) 1000 loops, best of 3: 223 µs per loop

But, as you can see, this is a bit slower, so I don’t know how to suggest it. I assume that you should do this as recommended in the documentation, but on the other hand, this can be an important speed difference for you.

2) As an alternative, this is rather brute force-ish, but you can just keep your index / columns, reset the index / columns should be simple, and then set the index / columns back to multi. Although, this is not completely different from the fact that you accept df.values , and I suspect that it is not convenient for you.

Why does Pandas.loc speed in Pandas depend on the initialization of the DataFrame? How to make MultiIndex.loc as fast as possible?

Edit to add:

More articles: