Shuffle multiple HDF5 datasets in place

I have several HDF5 datasets stored in the same file, my_file.h5 . These data sets have different sizes, but the same number of observations in the first dimension:

 features.shape = (1000000, 24, 7, 1) labels.shape = (1000000) info.shape = (1000000, 4) 

It is important that info / label data is correctly connected to each set of functions, and therefore I want to shuffle these data sets with the same seed. Also, I would like to shuffle them without loading them completely into memory. Is it possible to use numpy and h5py?

+6
source share
2 answers

Shuffling arrays on disk will be time consuming, as this means that you have allocated new arrays in the hdf5 file, and then copy all the lines in a different order. You can iterate over strings (or use pieces of strings) if you don't want to load all the data into memory at once using PyTables or h5py.

An alternative approach could be to save your data as it is, and simply map the new line numbers to the old line numbers in a separate array (which you can completely load into RAM, since it will be only 4 MB with the size of your array) . For example, to shuffle an numpy x array,

 x = np.random.rand(5) idx_map = numpy.arange(x.shape[0]) numpy.random.shuffle(idx_map) 

Then you can use numpy advanced indexing to access your shuffled data,

 x[idx_map[2]] # equivalent to x_shuffled[2] x[idx_map] # equivament to x_shuffled[:], etc. 

this will also work with arrays stored in hdf5. Of course, there would be some overhead compared to writing shuffled arrays on disk, but that might be enough depending on your use case.

+1
source

Mixing arrays like this in numpy are straightforward

Create a large binding index (shuffle np.arange(1000000) ) and index arrays

 features = features[I, ...] labels = labels[I] info = info[I, :] 

This is not an inplace operation. labels[I] is a copy of labels , not a fragment or representation.

Alternative

 features[I,...] = features 

looks at the surface as inplace. I doubt that it is, in the C code. It should be buffered because the values ​​of I not guaranteed to be unique. Actually there is a special ufunc .at method for unbuffered operations.

But look what h5py says about the same "fancy indexing":

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

labels[I] implemented selection, but with limitations.

 List selections may not be empty Selection coordinates must be given in increasing order Duplicate selections are ignored Very long lists (> 1000 elements) may produce poor performance 

Your shuffled I by definition, not in ascending order. And he is very big.

Also, I don’t see anything about using this fantastic indexing on my left hand, labels[I] = ...

+1
source

Source: https://habr.com/ru/post/989512/


All Articles