Dask equivalent DataFrame pandas Sort data DataFrame

What would be the equivalent of sort_values ​​in pandas for a dask DataFrame? I am trying to scale some pandas code that has memory problems to use a Dask DataFrame instead.

Will the equivalent be:

ddf.set_index([col1, col2], sorted=True)

?

+2
source share
1 answer

Sorting in parallel is complicated. You have two options in Dask.dataframe

set_index

As now, you can call set_index with the index of one column:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf.set_index('x').compute()
Out[5]: 
   y
x   
1  c
2  b
3  a

Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes

In [6]: ddf.set_index(['x', 'y']).compute()
NotImplementedError: Dask dataframe does not yet support multi-indexes.
You tried to index with this index: ['x', 'y']
Indexes must be single columns only.

nlargest

Given how you formulated your question, I suspect that this does not apply to you, but often cases that use sorting can do with a much cheaper nlargest solution .

In [7]: ddf.x.nlargest(2).compute()
Out[7]: 
0    3
1    2
Name: x, dtype: int64

In [8]: ddf.nlargest(2, 'x').compute()
Out[8]: 
   x  y
0  3  a
1  2  b
+2
source

Source: https://habr.com/ru/post/1694227/


All Articles