Python Dask - dataframe.map_partitions () return value

So dask.dataframe.map_partitions()accept the argument funcand metakwarg. How exactly does he determine his type of return? As an example:

Lots of csv in ... \ some_folder.

ddf = dd.read_csv(r"...\some_folder\*", usecols=['ColA', 'ColB'], 
                                        blocksize=None, 
                                        dtype={'ColA': np.float32, 'ColB': np.float32})
example_func = lambda x: x.iloc[-1] / len(x)
metaResult = pd.Series({'ColA': .1234, 'ColB': .1234})
result = ddf.map_partitions(example_func, meta=metaResult).compute()

I am new to “distributed” computing, but I would intuitively expect this to return a collection (list or dict, most likely) of Series objects, but the result is a Series object, which can be considered a concatenation of example_func results on each section. This in itself would also be sufficient if this series had MultiIndex to indicate the section label.

From what I can say from this question , docs and the source code itself , is it because it ddf.divisionswill return (None, None, ..., None)csv's as a result of reading? Is there a dask-based method for this, or do I need to manually enter and interrupt the returned series (the concatenation of the series that was returned example_functo each section) independently?

Also, feel free to correct my assumptions / practices here, as I'm new to dask.

+4
source share
1 answer

So dask.dataframe.map_partitions () accepts the argument func and meta kwarg. How exactly does he determine his type of return?

map_partition , func, Dask DataFrame, dask "" . func:

  • func , map_partitions dask.
  • func pd.Series, map_partition dask, pd.Series, func, .
  • func pd.DataFrame, map_partitions Dataframe dask, pd.DataFrame obejcts .

, get_partition(). , ddf csv, . func , pd.DataFrame, , , .

+2

Source: https://habr.com/ru/post/1661100/


All Articles