What backup storage mechanism should I use for my Python library?

I am writing a data processing library in Python that reads data from various sources into memory, processes them, and then exports them to various formats. I loaded this data into memory, but some of the data sets that I process can be especially large (more than 4 GB).

I need an open source library for backup storage that can elegantly work with large datasets. It needs the ability to dynamically change the data structure (add, rename and delete columns) and maintain a reasonable fast iteration. Ideally, it should be able to handle strings and integers of arbitrary size (just like python does), but I can compile them in the library if necessary. And he should be able to handle missing values.

Does anyone have any suggestions?

+4
source share
4 answers

A documented database should handle this kind of workload well if you don't have complex joins.

Common representatives will be CouchDB or MongoDB .

Both are well suited for MapReduce algorithms, such as iterating over all datasets. If you want to combine rows with new data, you want the "table" to be sorted or have quick access to individual elements: both come down to an index.

Documented databases support several "tables", having documents with different schemes. They can easily request documents with a specific scheme.

I donโ€™t think you will find an easy solution for processing several 4 GB data sets with the requirements you specified. Especially dynamic data structures are difficult to implement quickly.

+3
source

Give Metakit . It provides flexibility in schemas and has Python bindings. Despite the fact that he did not succeed, it was about a while.

+1
source

Another idea might be to use Hadoop for your backend. It bears a resemblance to the previously mentioned CouchDB , but focuses more on efficiently handling large datasets using MapReduce .

Compared to CouchDB, Hadoop is not suitable for real-time applications or as a database behind a website, because it has high latency for access to one record, but it really shines when repeating all elements and calculations, even Peta bytes of data.

So maybe you should give Hadoop a try. Of course, it may take some time to get used to these MapReduce algorithms, but they are really great ways to describe such problems. And you do not need to deal with storing intermediate results yourself. And a good side effect is that your algorithm will work when your dataset gets larger, but then you may have to add another server. :-)

There are also many books and documentation on the available Hadoop and MapReduce, but here is a good tutorial that can help you get started with Hadoop and Python.

+1
source

Pytables may be the answer for you, although I suspect that it is mainly used for numeric data, it may also match your score (according to what I see on their home page).

0
source

Source: https://habr.com/ru/post/1301145/


All Articles