I am new to Dask and have some problems with it.
I use a machine (4 GB of RAM, 2 cells) to analyze two csv files (key.csv: ~ 2 million lines about 300 MB, sig.csv: ~ 12 million lines about 600 MB). with this data, pandas cannot fit into memory, so I switch to using Dask.dataframe. I expect that Dask will process things in small fragments that can be placed in memory (speed may be slower, I donβt mind at all while this works), however, somehow, Dask still uses all the memory.
My code is as below:
key=dd.read_csv("key.csv") sig=dd.read_csv("sig.csv") merge=dd.merge(key,sig,left_on["tag","name"], right_on["key_tag","query_name"],how="inner") merge.to_csv("test2903_*.csv")
Am I making mistakes? any help is appreciated.
source share