I am taking data from a file that receives data from 5 second OHLCVT Interactive Brokers columns through the Sierra Chart.
Following the tips in earlier posts, rather than adding every new row to the data framework, I create a DataFrame with a historical file and add 5,000 “empty” records with the correct timestamps to it. Then I write each new line on top of an empty line, filling in any lines if there are no timestamps and updating pointers.
It works well. Here are the current classes and functions . My initial version created 5,000 rows of NaNs (OHLCVxyz). I thought it would be more accurate to start with finite data types in order to convert “empty” entries to zeros with OHLC, which are float, and Vxyz are ints, using:
dg.iloc[0:5000] = 0.0 dg[[v, x, y, z]] = dg[[v, x, y, z]].astype('int')
This happens only once per 5000 lines (once a day for HSI). I was struck by the effect on the read / write loops. They went from 0.8 ms to 3.4 ms per line. The only change was from NaN to zeros.
This figure shows the initial run with a zero filled frame (see timestamps 0.0038), and then a run with a filled NaN frame (timestats 0.0008).
Can someone explain why he can add so much time to write in the fields [0,0, 0,0, 0,0, 0,0, 0, 0, 0, 0] instead of [NaN, NaN, NaN , NaN, NaN, NaN, NaN, NaN]?
Any thoughts on improving the code are also welcome. :)
thanks
EDIT +17 hours
Following questions from @BrenBarn, I built a simpler model that anyone with no data could manage. In doing so, I eliminated the question of whether NaN affects it. In this version, I was able to record 0.0s for both versions, and the difference was the same:
- an array with 8 columns of floats is added 10 times faster than an array with 4 columns of floats and 4 from int64.
- in each case, the line to be added was [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8]
- adding is done 10,000 times with self.df.iloc [self.end] = datarow and end increment end.
So, if I'm not mistaken (always possible), it seems that adding data to a framework with 4 columns of floats and 4 of ints takes 10 times. Is this a problem for pandas or what should be expected?
Here is the test code and here is the output image
I think that having an array of 350,000 rows of 8 columns before adding to it is significant. My initial tests, adding 10 lines, showed no effect - I have to go back and repeat them.
EDIT +10 minutes
No, I came back and created an intial array with only 10 rows, and the effect on the add loops did not change, so it was not the size of the original / dataframe array. It is likely that in my previous test I thought that I would convert the columns to ints, but I didn’t do this - checking this showed that the command I thought would do it wrong.
da = SierraFrame(range(10), np.zeros((10,8))) da.extend_frame1()
EDIT and Possible Response +35 minutes
If this question is not given in more detail.
At the moment, my hypothesis consists in the fact that the basic functions for adding [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8] to the backup row in the data frame are different if df contains all one type than if it contains columns of floats and ints. I just tested it with all int64, and the average value was 0.41 ms versus 0.37 m for all floats and 2.8 ms for a mixed frame. Int8s took 0.39 ms. I believe that the mixture affects the ability of pandas to optimize its action , so if efficiency is very important, then the best option would be to frame the data with all columns of the same type (probably float64).
Tests conducted on Linux x64 with Python 3.3.1