Pandas: why adding float and ints to the data frame should be slower than if its full NaN

Question

Pandas: why adding float and ints to the data frame should be slower than if its full NaN

I am taking data from a file that receives data from 5 second OHLCVT Interactive Brokers columns through the Sierra Chart.

Following the tips in earlier posts, rather than adding every new row to the data framework, I create a DataFrame with a historical file and add 5,000 “empty” records with the correct timestamps to it. Then I write each new line on top of an empty line, filling in any lines if there are no timestamps and updating pointers.

It works well. Here are the current classes and functions . My initial version created 5,000 rows of NaNs (OHLCVxyz). I thought it would be more accurate to start with finite data types in order to convert “empty” entries to zeros with OHLC, which are float, and Vxyz are ints, using:

dg.iloc[0:5000] = 0.0 dg[[v, x, y, z]] = dg[[v, x, y, z]].astype('int')

This happens only once per 5000 lines (once a day for HSI). I was struck by the effect on the read / write loops. They went from 0.8 ms to 3.4 ms per line. The only change was from NaN to zeros.

This figure shows the initial run with a zero filled frame (see timestamps 0.0038), and then a run with a filled NaN frame (timestats 0.0008).

Can someone explain why he can add so much time to write in the fields [0,0, 0,0, 0,0, 0,0, 0, 0, 0, 0] instead of [NaN, NaN, NaN , NaN, NaN, NaN, NaN, NaN]?

Any thoughts on improving the code are also welcome. :)

thanks

EDIT +17 hours

Following questions from @BrenBarn, I built a simpler model that anyone with no data could manage. In doing so, I eliminated the question of whether NaN affects it. In this version, I was able to record 0.0s for both versions, and the difference was the same:

an array with 8 columns of floats is added 10 times faster than an array with 4 columns of floats and 4 from int64.
in each case, the line to be added was [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8]
adding is done 10,000 times with self.df.iloc [self.end] = datarow and end increment end.

So, if I'm not mistaken (always possible), it seems that adding data to a framework with 4 columns of floats and 4 of ints takes 10 times. Is this a problem for pandas or what should be expected?

Here is the test code and here is the output image

I think that having an array of 350,000 rows of 8 columns before adding to it is significant. My initial tests, adding 10 lines, showed no effect - I have to go back and repeat them.

EDIT +10 minutes

No, I came back and created an intial array with only 10 rows, and the effect on the add loops did not change, so it was not the size of the original / dataframe array. It is likely that in my previous test I thought that I would convert the columns to ints, but I didn’t do this - checking this showed that the command I thought would do it wrong.

 da = SierraFrame(range(10), np.zeros((10,8))) da.extend_frame1()

EDIT and Possible Response +35 minutes

If this question is not given in more detail.

At the moment, my hypothesis consists in the fact that the basic functions for adding [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8] to the backup row in the data frame are different if df contains all one type than if it contains columns of floats and ints. I just tested it with all int64, and the average value was 0.41 ms versus 0.37 m for all floats and 2.8 ms for a mixed frame. Int8s took 0.39 ms. I believe that the mixture affects the ability of pandas to optimize its action , so if efficiency is very important, then the best option would be to frame the data with all columns of the same type (probably float64).

Tests conducted on Linux x64 with Python 3.3.1

+6

python casting pandas

John 9631 Jun 17 '13 at 6:49

source share

1 answer

Brenbarn · Answer 1 · 2013-06-18T18:37:32+0000

As described in this blog post by the main author of pandas , pandas DataFrame internally consists of “blocks”. A block is a group of columns that have the same data type. Each block is stored as a numpy array of its block type. Therefore, if you have five int columns and then five float columns, there will be an int and float block.

Adding to a multi-type array requires adding numpy to each of the base arrays. Adding numpy to arrays is slow because it requires creating a whole new numpy array. Therefore, it makes sense that adding to a multi-type DataFrame is slow: if all the columns are of the same type, he needs to create only one new numpy array, but if they are different, he needs to create several new numpy arrays.

It is true that storing data of the same type will speed this up. However, I would say that the main conclusion is not "if efficiency is important, keep all your columns of the same type." Conclusion , if efficiency is important, do not try to add / DataFrames to your arrays .

This is how numpy works. The slowest part of working with numpy arrays is creating them first. They have a fixed size, and when you “add” to one, you really just create a whole new one with a new size that is slow. If you absolutely must add to them, you can try something like messing with the types to ease the pain a bit. But in the end, you just need to accept that anytime you try to add to a DataFrame (or a numpy array in general), you are likely to experience significant success.

Pandas: why adding float and ints to the data frame should be slower than if its full NaN

More articles: