Decompression and reading Dukascopy.bi5 files

Question

Decompression and reading Dukascopy.bi5 files

I need to open the .bi5 file and read the contents to shorten the long history. Problem: I have tens of thousands of .bi5 files containing time series data that I need to unpack and process (read, upload to pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, since I was faced with compiling nightmares using lzma back-ports for Python 2.7, so I lost and worked with Python 3, but without success. There are too many problems to disclose, no one reads long questions!

I included one of the .bi5 files, if someone could get it in the Pandas Dataframe and show me how they did it, that would be ideal.

ps fie is only a few kilobytes, it will be loaded in a second. Thank you very much in advance.

(File) http://www.filedropper.com/13hticks

+5

python pandas binary csv lzma

ajsp Dec 16 '16 at 1:56

source share

2 answers

You tried to use numpy to analyze data before porting to pandas. This may be a long solution, but I’ll let you manipulate and clear the data before you do the analysis in Panda, also the integration between them is pretty simple,

0

dsapandora Dec 18 '16 at 4:36

source share

ptrj · Accepted Answer · 2016-12-18T04:31:30+0000

The code below should do the trick. First, it opens the file and decodes it in lzma , and then uses a struct to unpack the binary data.

 import lzma import struct import pandas as pd def bi5_to_df(filename, fmt): chunk_size = struct.calcsize(fmt) data = [] with lzma.open(filename) as f: while True: chunk = f.read(chunk_size) if chunk: data.append(struct.unpack(fmt, chunk)) else: break df = pd.DataFrame(data) return df

The most important thing is to know the correct format. I googled around and tried to guess and '>3i2f' (or >3I2f ) works pretty well. (This is a large number of endian 3 ints 2. What do you suggest: 'i4f' does not create reasonable floats - regardless of whether it is large or small endian.) For struct syntax and format, see docs .

 df = bi5_to_df('13h_ticks.bi5', '>3i2f') df.head() Out[177]: 0 1 2 3 4 0 210 110218 110216 1.87 1.12 1 362 110219 110216 1.00 5.85 2 875 110220 110217 1.00 1.12 3 1408 110220 110218 1.50 1.00 4 1884 110221 110219 3.94 1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy , I compiled and ran test_read_bi5 from there. The first lines of output:

 time, bid, bid_vol, ask, ask_vol 2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5 2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5 2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25 2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5 2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

 bi5_to_df('01h_ticks.bi5', '>3I2f').head() Out[295]: 0 1 2 3 4 0 3581 131966 131945 1.50 1.5 1 5142 131964 131943 1.50 1.5 2 5202 131964 131943 2.25 1.5 3 5321 131964 131944 1.50 1.5 4 5441 131964 131944 1.50 1.5

So everything seems beautiful (ninety47 code reorders columns).

In addition, it is more likely to use '>3i2f' instead of '>3i2f' (i.e. unsigned int instead of int ).

Decompression and reading Dukascopy.bi5 files

More articles: