Is there any analysis speed or memory usage for using HDF5 to store large arrays (instead of flat binary files)?

Question

Is there any analysis speed or memory usage for using HDF5 to store large arrays (instead of flat binary files)?

I process large 3D arrays, which I often need to cut into various ways to do a variety of data analysis. A typical cube might be ~ 100 GB (and probably more in the future)

It seems like a typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage for using HDF5 to store and analyze these cubes for storing them in simple flat binary files? Is HDF5 more suitable for tabular data, as opposed to large arrays, like what I'm working with? I can see that HDF5 can provide good compression, but I'm more interested in processing speed and memory overflow handling.

Often I want to analyze only one large subset of a cube. One of the drawbacks of both pytables and h5py is that when I take a slice of an array, I always return a numpy array using memory. However, if I slice a numpy memmap flat binary, I can get a view that stores data on disk. Thus, it seems that I can more easily analyze certain sections of my data without overloading my memory.

I studied both pytables and h5py, and have not seen the benefits so far for my purpose.

+49

python numpy hdf5 pytables h5py

Caleb Dec 30 '15 at 18:00

source share

1 answer

Joe Kington · Accepted Answer · 2014-12-30 22:26

HDF5 Benefits: Organization, flexibility, interoperability

Some of the main advantages of HDF5 are its hierarchical structure (similar to folders / files), optional arbitrary metadata stored in each element, and its flexibility (for example, compression). This organizational structure and metadata repository may seem trivial, but it is very useful in practice.

Another advantage of HDF is that datasets can be either fixed or flexible. Thus, it is easy to add data to a large data set without having to create a whole new copy.

In addition, HDF5 is a standardized format with libraries available for almost any language, so sharing data on disk between, say, Matlab, Fortran, R, C and Python is very easy with HDF. (To be fair, it is not too complicated with a large binary array if you know the order of C and F and know the shape, dtype, etc. of the stored array.)

Advantages of HDF for a large array: faster input / output of arbitrary slice

Just like TL / DR:. For a ~ 8 GB 3D array, reading a “full” slice along any axis took ~ 20 seconds with an HDF5 dataset and 0.3 seconds (at best) up to more than three hours (worst case) for a memmapped array of the same data.

In addition to the above things, there is another big advantage in the format of "chunked" * on-disk data, such as HDF5: reading an arbitrary fragment (allocation to arbitrary) will usually be much faster, since data on a disk is more contiguous on average.

* (HDF5 does not have to be a recorded data format. It supports chunking but does not require it. In fact, the default value for creating a dataset in h5py not a fragment, if I remember correctly.)

Basically, your maximum read speed on a disk and the read speed of a drive with the smallest case for a given fragment of your data set will be close enough to the HDF data set (if you chose a reasonable fragment size or let the library choose one for you). With a simple binary array, the best option is faster, but the worst option is much worse.

One warning, if you have an SSD, you probably won't notice a huge difference in read / write speed. However, using a regular hard drive, sequential reads are much, much faster than random reads. (i.e. a regular hard drive has a long seek time). HDF still has an advantage on SSDs, but more thanks to other features (e.g. metadata, organizations, etc.) than due to raw speed.

First of all, to eliminate confusion, accessing the h5py returns an object that behaves similar to a numpy array but does not load data into memory until it is sliced. (Similar to memmap, but not identical.) For more information, see h5py .

Slicing the data set will load a subset of the data into memory, but presumably you want to do something with it, in which case you still need it in memory.

If you want to do extra-corporate computing, you can pretty easily use tabular data with pandas or pytables . This is possible with h5py (better for large ND arrays), but you need to h5py down to the lower level of touch and handle the iteration yourself.

However, the future of multiplayer computing is Blaze. Look at it if you really want to go this route.

"unchunked" case

First, consider a 3D-C-ordered array written to disk (I will model it by calling arr.ravel() and printing the result to make things more visible):

 In [1]: import numpy as np In [2]: arr = np.arange(4*6*6).reshape(4,6,6) In [3]: arr Out[3]: array([[[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [ 12, 13, 14, 15, 16, 17], [ 18, 19, 20, 21, 22, 23], [ 24, 25, 26, 27, 28, 29], [ 30, 31, 32, 33, 34, 35]], [[ 36, 37, 38, 39, 40, 41], [ 42, 43, 44, 45, 46, 47], [ 48, 49, 50, 51, 52, 53], [ 54, 55, 56, 57, 58, 59], [ 60, 61, 62, 63, 64, 65], [ 66, 67, 68, 69, 70, 71]], [[ 72, 73, 74, 75, 76, 77], [ 78, 79, 80, 81, 82, 83], [ 84, 85, 86, 87, 88, 89], [ 90, 91, 92, 93, 94, 95], [ 96, 97, 98, 99, 100, 101], [102, 103, 104, 105, 106, 107]], [[108, 109, 110, 111, 112, 113], [114, 115, 116, 117, 118, 119], [120, 121, 122, 123, 124, 125], [126, 127, 128, 129, 130, 131], [132, 133, 134, 135, 136, 137], [138, 139, 140, 141, 142, 143]]])

The values will be saved to disk sequentially as shown in line 4 below. (Let it ignore file system data and fragmentation for now.)

 In [4]: arr.ravel(order='C') Out[4]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])

At best, let's take a slice along the first axis. Note that these are only the first 36 values of the array. It will be read very quickly! (one to look, one to read)

 In [5]: arr[0,:,:] Out[5]: array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35]])

Similarly, the next slice along the first axis will be just the next 36 values. To read the full slice along this axis, we need only one seek operation. If all we are going to read is different fragments along this axis, then this is an ideal file structure.

However, consider the worst case scenario: a slice along the last axis.

 In [6]: arr[:,:,0] Out[6]: array([[ 0, 6, 12, 18, 24, 30], [ 36, 42, 48, 54, 60, 66], [ 72, 78, 84, 90, 96, 102], [108, 114, 120, 126, 132, 138]])

To read this fragment, we need 36 queries and 36 views, since all values are shared on disk. None of them are adjacent!

This may seem rather insignificant, but as we get large and large arrays, the number and size of seek operations grow rapidly. For a large three-dimensional array (~ 10 GB) saved in this way and read through memmap , reading a full fragment along the "worst" axis can easily take tens of minutes, even with modern equipment. At the same time, cutting along the best axis can take less than a second. For simplicity, I only show “full” fragments along one axis, but the same thing happens with arbitrary slices of any subset of data.

By the way, there are several file formats that use this and basically store three copies of huge 3D arrays on disk on the disk: one in C-order, one in F-order and one in between. (An example of this is the Geoprobe D3D format, although I'm not sure if it is documented somewhere.) Who cares, if the final file size is 4 TB, the storage is cheap! The crazy thing is that since the main use case is extracting one sub-slice in each direction, the reading you want to do is very, very fast. It works very well!

A simple "random" case

Say we store 2x2x2 “chunks” of a 3D array in the form of contiguous blocks on disk. In other words, something like:

 nx, ny, nz = arr.shape slices = [] for i in range(0, nx, 2): for j in range(0, ny, 2): for k in range(0, nz, 2): slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2))) chunked = np.hstack([arr[chunk].ravel() for chunk in slices])

Thus, the data on the disk will look like chunked :

 array([ 0, 1, 6, 7, 36, 37, 42, 43, 2, 3, 8, 9, 38, 39, 44, 45, 4, 5, 10, 11, 40, 41, 46, 47, 12, 13, 18, 19, 48, 49, 54, 55, 14, 15, 20, 21, 50, 51, 56, 57, 16, 17, 22, 23, 52, 53, 58, 59, 24, 25, 30, 31, 60, 61, 66, 67, 26, 27, 32, 33, 62, 63, 68, 69, 28, 29, 34, 35, 64, 65, 70, 71, 72, 73, 78, 79, 108, 109, 114, 115, 74, 75, 80, 81, 110, 111, 116, 117, 76, 77, 82, 83, 112, 113, 118, 119, 84, 85, 90, 91, 120, 121, 126, 127, 86, 87, 92, 93, 122, 123, 128, 129, 88, 89, 94, 95, 124, 125, 130, 131, 96, 97, 102, 103, 132, 133, 138, 139, 98, 99, 104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])

And just to show that they are 2x2x2 arr blocks, note that these are the first 8 chunked values:

 In [9]: arr[:2, :2, :2] Out[9]: array([[[ 0, 1], [ 6, 7]], [[36, 37], [42, 43]]])

To read in any fragment along the axis, we read either 6 or 9 adjacent fragments (twice as much data as we need), and then we saved only the part that we wanted. This is the worst maximum of 9 views versus a maximum of 36 queries for the non-chunked version. (But the best option is 6 views vs 1 for the memmapped array.) Since sequential reads are very fast compared to queries, this greatly reduces the time taken to read an arbitrary subset into memory. And again, this effect becomes greater with large arrays.

HDF5 takes this a few steps further. Pieces should not be stored contiguously, and they are indexed by B-Tree. In addition, they do not have to be the same size on the disk, so compression can be applied to each fragment.

Broken arrays with `h5py`

By default, h5py does not create fragmented HDF files on disk (I think pytables does, vice versa). However, if you specify chunks=True when creating the dataset, you will get a broken array on disk.

As a quick, minimal example:

 import numpy as np import h5py data = np.random.random((100, 100, 100)) with h5py.File('test.hdf', 'w') as outfile: dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True) dset.attrs['some key'] = 'Did you want some metadata?'

Please note that chunks=True tells h5py automatically select the block size for us. If you know more about your most common use case, you can optimize the size / shape of the piece by specifying the shape tuple (e.g. (2,2,2) in the simple example above). This allows you to make reading on a specific axis more efficient or optimize for reading / writing a specific size.

I / O performance comparison

Just to emphasize the point, let's compare reading in slices from an HDF5 dataset with a large (~ 8 GB), Fortran-ordered 3D array containing the exact same data.

cleared all OS caches between each run, so we see "cold" performance.

For each file type, we will check reading in a “full” x-slice along the first axis and “full” z-slize along the last axis. For a Fortran memmapped ordered array of "x", the slice is the worst case, and the "z" slice is the best case.

The code used in the entity (including creating the hdf file). I cannot easily use the data used here, but you could simulate an array with zeros of the desired shape ( 621, 4991, 2600) and type np.uint8 .

chunked_hdf.py looks like this:

 import sys import h5py def main(): data = read() if sys.argv[1] == 'x': x_slice(data) elif sys.argv[1] == 'z': z_slice(data) def read(): f = h5py.File('/tmp/test.hdf5', 'r') return f['seismic_volume'] def z_slice(data): return data[:,:,0] def x_slice(data): return data[0,:,:] main()

memmapped_array.py is similar, but it is more complex to ensure that fragments are loaded into memory (by default, the memmapped array will be returned, which will not compare apples to apples).

 import numpy as np import sys def main(): data = read() if sys.argv[1] == 'x': x_slice(data) elif sys.argv[1] == 'z': z_slice(data) def read(): big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol' shape = 621, 4991, 2600 header_len = 3072 data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len, order='F', shape=shape, dtype=np.uint8) return data def z_slice(data): dat = np.empty(data.shape[:2], dtype=data.dtype) dat[:] = data[:,:,0] return dat def x_slice(data): dat = np.empty(data.shape[1:], dtype=data.dtype) dat[:] = data[0,:,:] return dat main()

First look at the performance of HDF:

 jofer at cornbread in ~ $ sudo ./clear_cache.sh jofer at cornbread in ~ $ time python chunked_hdf.py z python chunked_hdf.py z 0.64s user 0.28s system 3% cpu 23.800 total jofer at cornbread in ~ $ sudo ./clear_cache.sh jofer at cornbread in ~ $ time python chunked_hdf.py x python chunked_hdf.py x 0.12s user 0.30s system 1% cpu 21.856 total

A “full” x-slice and a “full” z-slice take about the same amount of time (~ 20 seconds). Given that it is an 8 GB array, this is not so bad. Most part of time

And if we compare this with memmapped array times (this is Fortran-ordered: "z-slice" is the best case, and "x-slice" is the worst case.):

 jofer at cornbread in ~ $ sudo ./clear_cache.sh jofer at cornbread in ~ $ time python memmapped_array.py z python memmapped_array.py z 0.07s user 0.04s system 28% cpu 0.385 total jofer at cornbread in ~ $ sudo ./clear_cache.sh jofer at cornbread in ~ $ time python memmapped_array.py x python memmapped_array.py x 2.46s user 37.24s system 0% cpu 3:35:26.85 total

Yes, you read it right. 0.3 seconds for one cutting direction and ~ 3.5 hours for another.

The cut-off time in the x direction is much longer than the time required to load the entire 8 GB array into memory and select the fragment we wanted! (Again, this is a Fortran ordered array. A simple x / z slice will take place for a C-ordered array.)

However, if we always want to take a slice at best, a large binary array on disk is very good. (~ 0.3 s!)

With the memmapped array, you are stuck in this I / O mismatch (or perhaps anisotropy is the best term). However, with an HDF dataset, you can choose to chunksize so that access is either equal or optimized for a specific use case. This gives you great flexibility.

Finally

Hope this helps, anyway, solve one part of your question. HDF5 has many other advantages over raw memmaps, but I have no way to extend them all here. Compression can speed up some things (the data I work with doesn’t benefit much from compression, so I rarely use it), and OS level caching often plays better with HDF5 files than with raw mem cards. In addition, HDF5 is a truly fantastic container format. This gives you great flexibility in managing your data and can be used from more or less any programming language.

All in all, give it a try and see if it works well for your use case. I think you may be surprised.

Is there any analysis speed or memory usage for using HDF5 to store large arrays (instead of flat binary files)?

HDF5 Benefits: Organization, flexibility, interoperability

Advantages of HDF for a large array: faster input / output of arbitrary slice

"unchunked" case

A simple "random" case

Broken arrays with h5py

I / O performance comparison

Finally

More articles:

Broken arrays with `h5py`