Distributing rows between sections in a Dask DataFrame

Expectation: I would expect that when I split a given data frame, the rows would be roughly evenly distributed across each section. Then I would expect that when I write a dataframe in csv, the resulting n csvs (in this case 10) would similarly have approximately equal length.

Reality: when I run the code below, I find that instead of some uniform distribution of the lines, all lines are in export_results-0.csv, and the remaining 9 csvs are empty.

Question: Are there any additional configurations that I need to set to ensure that the rows are distributed among all sections?

from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd

client = Client('tcp://10.0.0.60:8786')

df = pd.DataFrame({'geom': np.random.random(1000)}, index=np.arange(1000))
sd = dd.from_pandas(df, npartitions=100)

tall = dd.merge(sd.assign(key=0), sd.assign(key=0), on='key').drop('key', axis=1)
tall.to_csv('export_results-*.csv').compute()

: 1000 , 100 000 ( , , 100k +).

+4
1

, Dask , Dask .

Dask Pandas, , , . :

df1 = pd.DataFrame({ 'geom': np.random.random(200) }, index=np.arange(200))
sd1 = dd.from_pandas(df1.copy(), npartitions=5).assign(key=0)

tall = dd.merge(sd1, df1.assign(key=0), on='key', npartitions=10).drop('key', axis=1)
tall.to_csv('exported_csvs/res-*.csv')

. , , , , Dask.

+2

Source: https://habr.com/ru/post/1679492/


All Articles