Choosing a structure for analyzing data with memory using python

I am solving a problem with a dataset that has more memory. The original dataset is a CSV file. One of the columns is the identifiers of tracks from the musicbrainz service.

What i have already done

I read the .csv file with dask and converted it to castra on disk for better performance. I also requested the Musicbrainz API and populated the sqlite database using peewee with some relevant results. I decided to use the database instead of another dask.dataframe, because the process took several days, and I did not want to lose data in case of any failures.

I have not yet begun to analyze the data. During the data permutation, I managed to create enough clutter.

Current problem

I find it difficult to connect to columns from an SQL database to a dask / castra database. In fact, I'm not sure if this is real.

Alternative approaches

It seems that I made some mistakes in choosing the best tools for this task. Castra is probably not mature enough, and I think this is part of the problem. It might also be better to choose SQLAlchemy in favor of peewee, as it is used by pandas and peewee not.

Blaze + HDF5 can be a good alternative for dask + castra, mainly because HDF5 is more stable / mature / complete than castra and the flame is less self-confident regarding data storage. For example. It can simplify the integration of SQL DB into the main data set.

On the other hand, I am familiar with pandas, and dask exposes the "same" API. With dask, I also get parallelism.

TL; DR

, dataset + sqlite DB, . , dask + castra ( dask.dataframe) SQLAlchemy SQL DB dataframe pandas. , , - + HDF5. ?

/ . , SO.

+4
1

:

  • .

- , HDF5 CSV ( ). Dask.dataframe , pandas.

  • , , dask.dataframe SQL.

, . SQL dask.dataframe, . .

+1

Source: https://habr.com/ru/post/1611643/


All Articles