Dassian frame: memory error when merging

Question

Dassian frame: memory error when merging

I play with some github user data and try to create a graph of all people in one city. To do this, I need to use the merge operation in dask. Unfortunately, the size of the github database is 6M, and it seems that the merge operation leads to the explosion of the resulting frame. I used the following code

import dask.dataframe as dd
gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
mrg = gh.merge(st, on='city').drop('city', axis=1)
mrg['max'] = mrg.max(axis=1)
mrg['min'] = mrg.min(axis=1)
mrg.to_castra('github')

I can combine by other criteria, such as username / username, using this code, but I get a MemoryError when I try to run this code.

I tried to run this using schedulers with synchronization / multiprocessor and thread schedulers.

I am trying to do this on a Dell i7 4core laptop with 8 GB of RAM. Should this operation be superimposed in order or am I mistaken? Is writing code using pandas dataframe iterators the only way out?

+4

python dask

Prasanjit prakash Aug 24 '16 at 5:21

source share

No one has answered this question yet.

See related questions:

4268

How to combine two dictionaries in one expression?

14

In what situation can I use Dask instead of Apache Spark?

5

Combining a large Dask dataset with a small Pandas data frame