How to add data to one specific dataset in hdf5 file with h5py

I am looking for an opportunity to add data to an existing dataset in a .h5 file using Python ( h5py ).

A brief introduction to my project: I'm trying to train CNN using medical image data. Due to the huge amount of data and heavy memory usage when converting data to NumPy arrays, I needed to divide the "conversion" into several pieces of data: load and pre-process the first 100 medical images and save the NumPy arrays in hdf5. file, then download the next 100 data sets and add the existing .h5 file, etc.

Now I tried to save the first 100 converted NumPy arrays as follows:

 import h5py from LoadIPV import LoadIPV X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV() with h5py.File('.\PreprocessedData.h5', 'w') as hf: hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9)) hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9)) hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1)) hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1)) 

As you can see, the converted NumPy arrays are divided into four different “groups”, which are stored in four hdf5 data hdf5 [X_train, X_test, Y_train, Y_test] . The LoadIPV() function preprocesses the medical image data.

My problem is that I would like to save the following 100 NumPy arrays in the same .h5 file in existing datasets: this means that I would like to add, for example, an existing X_train form [100, 512, 512, 9] with the next 100 NumPy arrays, so X_train takes on the form [200, 512, 512, 9] . The same should work for the other three datasets X_test , Y_train and Y_test .

+26
source share
2 answers

I found a solution that seems to work!

Take a look at this: incremental write to hdf5 with h5py !

To add data to a specific data set, you must first resize the specific data set on the appropriate axis, and then add new data at the end of the "old" nparray.

So the solution looks like this:

 with h5py.File('.\PreprocessedData.h5', 'a') as hf: hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0) hf["X_train"][-X_train_data.shape[0]:] = X_train_data hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0) hf["X_test"][-X_test_data.shape[0]:] = X_test_data hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0) hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0) hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data 

However, note that you must create the dataset using maxshape=(None,) , for example

 h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,)) 

otherwise, the data set cannot be expanded.

+27
source

add answer Midas.Inc. It is available for adding to the dataset. You should update the version of h5py, otherwise an error may occur

 IOError: Unable to create file (file exists) 

and

 with h5py.File('.\PreprocessedData.h5', 'a') as hf: 

as I know, ". \" is used on Windows, on Ubuntu it is expected "./".

I'm too new to comment for lesser reputation

0
source

Source: https://habr.com/ru/post/1273076/


All Articles