Dask.dataframe: from memory when merging and grouping

I am new to Dask and have some problems with it.

I use a machine (4 GB of RAM, 2 cells) to analyze two csv files (key.csv: ~ 2 million lines about 300 MB, sig.csv: ~ 12 million lines about 600 MB). with this data, pandas cannot fit into memory, so I switch to using Dask.dataframe. I expect that Dask will process things in small fragments that can be placed in memory (speed may be slower, I don’t mind at all while this works), however, somehow, Dask still uses all the memory.

My code is as below:

key=dd.read_csv("key.csv") sig=dd.read_csv("sig.csv") merge=dd.merge(key,sig,left_on["tag","name"], right_on["key_tag","query_name"],how="inner") merge.to_csv("test2903_*.csv") # store results into a hard disk since it cant be fit in memory 

Am I making mistakes? any help is appreciated.

+5
source share

Source: https://habr.com/ru/post/1266075/


All Articles