Dask dataframe apply meta

I want to make a dasksampling rate on a single dataframe column . The code works, but I get warningcomplaining that it is metanot defined. If I try to determine meta, I get an error message AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case, this does not look the way I need to be defined meta, but I would like to know how to do this for future reference.

Dummy data frame and column frequencies

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
                   ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
                   [12, 10, 15, 23, 18, 20, 26]],
                  index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)

In [39]: dask_df.head()
Out[39]: 
  Column A Column B Column C
0      Sam      Sam       12
1     Alex    David       10
2    David    David       15
3    Sarah    Alice       23
4    Alice      Sam       18

(dask_df.groupby('Column B')
        .apply(lambda group: len(group))
       ).compute()

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
Out[60]: 
Column B
Alice    2
David    2
Sam      3
dtype: int64

Attempt to identify metacreatesAttributeError

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()

for this

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()

the same thing if I try to use dtypebe intinstead of "int"or, for that matter, 'f8'or np.float64, so that doesn't seem like a problem dtype.

meta, -, , , (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).

meta? ?

python 3.6 dask 0.14.3 pandas 0.20.2

+4
1

meta - / . , apply() , - . , meta, dask , , - , , . ( ) , , , (dataframe series) .

, ,

(dask_df.groupby('Column B')
     .apply(len, meta=('int'))).compute()

(dask_df.groupby('Column B')
     .apply(len, meta=pd.Series(dtype='int', name='Column B')))
+3

Source: https://habr.com/ru/post/1678776/


All Articles