I play with some github user data and try to create a graph of all people in one city. To do this, I need to use the merge operation in dask. Unfortunately, the size of the github database is 6M, and it seems that the merge operation leads to the explosion of the resulting frame. I used the following code
import dask.dataframe as dd
gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
mrg = gh.merge(st, on='city').drop('city', axis=1)
mrg['max'] = mrg.max(axis=1)
mrg['min'] = mrg.min(axis=1)
mrg.to_castra('github')
I can combine by other criteria, such as username / username, using this code, but I get a MemoryError when I try to run this code.
I tried to run this using schedulers with synchronization / multiprocessor and thread schedulers.
I am trying to do this on a Dell i7 4core laptop with 8 GB of RAM. Should this operation be superimposed in order or am I mistaken? Is writing code using pandas dataframe iterators the only way out?
source
share