Parquet Nesting with Python

I have a file with one JSON per line. Here is an example:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I could not use it in python (not sure why).

I am a heavy user of pandas and dask, so the pipeline I'm trying to build is json data -> dask -> parquet -> pandas, although if someone has a simple example of creating and reading these nested encodings in parquet using Python I think it will be good enough: D

EDIT

So, after digging in PR, I found this: https://github.com/dask/fastparquet/pull/177

. , . dask/fastparquet, product ?

+4
1

, - Python. Arrow/parquet-cpp (. https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), ( / ). , , Parquet, Impala, Hive, Presto, Drill Spark, SQL, Python.

fastparquet, ( ), , .

, , ( -cpp) , , .

+4

Source: https://habr.com/ru/post/1682362/


All Articles