Dask.dataframe: from memory when merging and grouping

Question

Dask.dataframe: from memory when merging and grouping

I am new to Dask and have some problems with it.

I use a machine (4 GB of RAM, 2 cells) to analyze two csv files (key.csv: ~ 2 million lines about 300 MB, sig.csv: ~ 12 million lines about 600 MB). with this data, pandas cannot fit into memory, so I switch to using Dask.dataframe. I expect that Dask will process things in small fragments that can be placed in memory (speed may be slower, I don’t mind at all while this works), however, somehow, Dask still uses all the memory.

My code is as below:

key=dd.read_csv("key.csv") sig=dd.read_csv("sig.csv") merge=dd.merge(key,sig,left_on["tag","name"], right_on["key_tag","query_name"],how="inner") merge.to_csv("test2903_*.csv") # store results into a hard disk since it cant be fit in memory

Am I making mistakes? any help is appreciated.

+5

dask

Tho le phuoc Mar 29 '17 at 13:14

source share

No one has answered this question yet.

See related questions:

fifteen

Foliar treatment of rare CSR arrays

4

Choosing a structure for analyzing data with memory using python

3