What is map_partitions?

The dask API says map_partition can be used to "apply a Python function on each section of the DataFrame." From this description and in accordance with the usual behavior of the map, I expect the return value of map_partitions to be a (sort of) list, the length of which is equal to the number of sections. Each list item must be one of the return values ​​of function calls.

However, regarding the following code, I'm not sure if the return value depends on:

#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)

#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
    print('function called\n')
    return VAL

#check result
out = ddf.map_partitions(helper).compute()
print(len(out))
  • VAL = pd.Series({'A': 1}) calls 4 function calls (possibly to output dtype and 3 for partitions) and output with len == 3 and type pd.Series.
  • pd.DataFrame({'A': [1]}) results in the same numbers, however the resulting type is pd.DataFrame.
  • VAL = None TypeError... ? map_partitions -, -?
  • VAL = 1 2 . map_partitions 1.

:

  • map_partitions?
  • , / ?
  • , "" -, .. ?
  • , ?
+3
1

Dask DataFrame.map_partitions Dash Dataframe Series . . API.

+1

Source: https://habr.com/ru/post/1661102/


All Articles