How can I combine HDF5 data into multiple Python processes?

I am trying to load HDF5 data from a memory cache (memcached) or network, and then request it (read-only) from several Python processes without making a separate copy of the entire data set. Intuitively, I would like to make an mmap image (as it appeared on disk) in several processes, and then request it from Python.

I find it difficult to achieve, therefore the question. Pointers / corrections appreciated.

Ideas i have learned so far

  • pytables . It looks the most promising, it supports a rich interface for querying HDF5 data, and it (unlike numpy) seems to work with data without creating (process a local) copy of the data. It even supports the File.get_file_image() method, which would seem to get an image of the file. What I do not see is how to create a new file / FileNode from a memory image, not a file on disk.
  • h5py . Another way to get HDF5 data, as for pytables, seems to require a file on disk. The option = 'core' driver looks promising, but I don’t see how to create an existing mmap'd region in it, and not select it.
  • numpy . A lower level approach, if I share my raw data via mmap, then I could create a numpy ndarray that can access this data. But the corresponding constructor ndarray.__new__(buffer=...) says that it will copy the data, and numpy views can only be built from existing ndarrays, and not from raw buffers.
  • ctypes is a very low-level approach (perhaps using the multiprocessor shell Value to help a little). If I use ctypes directly, I can read my mmap'd data without any problems, but I would lose all the structural information and help from numpy / pandas / pytables to request it.
  • Allocate disk space . I could just select the file, write out all the data, and then share it through pytables in all my processes. My understanding is that this will be memory efficient since pytables does not copy (as long as it is not needed), and obviously the processes will share the OS cache of the main file image. My objection is that it is ugly and brings disk I / O to what I would like to be a clean memory system.
+5
source share
2 answers

mmap + main driver with H5py for read-only access in memory. I introduced a patch for H5py for working with file images some time ago for such scenarios. Unfortunately, he refused, because the upstream did not want to give users the opportunity to keep pace and securely control the buffer (via the introduced Python C protocol C-protocol 2.7), but for this it was necessary to change the side of HDF that I had not encountered. However, if this is important to you and you are very careful and able to create pyHDF yourself, look at the patch / pull request here

+1
source

I think the situation should be updated.

If a disk file is desired, Numpy now has a standard dedicated ndarray subclass: numpy.memmap

UPDATE: Looking at the implementation of multiprocessing.sharedctypes (CPython 3.6.2 shared memory block allocation code ), I found that it always creates tmp files for be mmap ed, so this is actually not a file solution.

If only pure RAM-based sharing is expected , some of them have demonstrated it using multiprocessing. RawArray: checking shared memory integration / numpy

+1
source

Source: https://habr.com/ru/post/1209919/


All Articles