Do xarray or dask really support memory mapping?

In my experiments, I tried:

  • xr.open_datasetwith chunksarg, and it loads the data into memory.
  • Set up NetCDF4DataStoreand recall ds['field'].valuesand load data into memory.
  • Set up ScipyDataStorewith mmap='r'and ds['field'].valuesload data into memory.

From what I saw, the design does not seem to focus on the actual use of numpy functions in memory-mapped arrays, but rather to load small fragments into memory (sometimes using memory matching for this). For example, this comment . And a somewhat related comment here is that xarray cannot determine if the array is numpy mmapped or not.

I would like to be able to represent and trim the data as xarray.Dataset, and be able to call .values(or .data) to get ndarray, but it will remain mmapped (for shared memory purposes, etc.).

It would also be nice if operations with marked objects could at least work with an array mapped to a memory card, until it really should mutate something, which seems possible, since dask seems to be constructed around immutable arrays.

I found a trick with xarray, though, so it would be like this:

data=np.load('file.npy', mmap_mode='r')
ds=xr.Dataset({'foo': (['dim1', 'dim2'], data)})

At this point, things like work are not loaded into memory:

np.sum(ds['foo'].values)
np.sum(ds['foo'][::2,:].values)

... xarray, apparently, does not know that the array is mmapped, and cannot afford to overlay a np.copyfor such cases.

"" memmapping (, , ) xarray dask?

+4
1

xr.open_dataset chunks= , dask.array, .

testfile = '/Users/mdurant/data/smith_sandwell_topo_v8_2.nc'
arr = xr.open_dataset(testfile, chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE
arr 

<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)> dask.array</Users/mdurant/data/smith_sandwell_topo_v8_2.nc:/ROSE, shape=(6336, 10800), dtype=float64, chunksize=(576, 720)> Coordinates: * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ... * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ... Attributes: long_name: Topography and Bathymetry ( 8123m -> -10799m) units: meters valid_range: [-32766 32767] unpacked_missing_value: -32767.0 ( dask.array )

xarray ( , )

arr.sum()

<xarray.DataArray 'ROSE' ()> dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>

arr.sum().values    # evaluates

, , , .

dask threaded scheduler , . , , .

+2

Source: https://habr.com/ru/post/1680009/


All Articles