I am solving a problem with a dataset that has more memory. The original dataset is a CSV file. One of the columns is the identifiers of tracks from the musicbrainz service.
What i have already done
I read the .csv file with dask and converted it to castra on disk for better performance. I also requested the Musicbrainz API and populated the sqlite database using peewee with some relevant results. I decided to use the database instead of another dask.dataframe, because the process took several days, and I did not want to lose data in case of any failures.
I have not yet begun to analyze the data. During the data permutation, I managed to create enough clutter.
Current problem
I find it difficult to connect to columns from an SQL database to a dask / castra database. In fact, I'm not sure if this is real.
Alternative approaches
It seems that I made some mistakes in choosing the best tools for this task. Castra is probably not mature enough, and I think this is part of the problem. It might also be better to choose SQLAlchemy in favor of peewee, as it is used by pandas and peewee not.
Blaze + HDF5 can be a good alternative for dask + castra, mainly because HDF5 is more stable / mature / complete than castra and the flame is less self-confident regarding data storage. For example. It can simplify the integration of SQL DB into the main data set.
On the other hand, I am familiar with pandas, and dask exposes the "same" API. With dask, I also get parallelism.
TL; DR
, dataset + sqlite DB, .
, dask + castra ( dask.dataframe) SQLAlchemy SQL DB dataframe pandas. , , - + HDF5.
?
/ .
, SO.