Parquet Nesting with Python

Question

Parquet Nesting with Python

I have a file with one JSON per line. Here is an example:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I could not use it in python (not sure why).

I am a heavy user of pandas and dask, so the pipeline I'm trying to build is json data -> dask -> parquet -> pandas, although if someone has a simple example of creating and reading these nested encodings in parquet using Python I think it will be good enough: D

EDIT

So, after digging in PR, I found this: https://github.com/dask/fastparquet/pull/177

. , . dask/fastparquet, product ?

dask : 0.15.1
fastparquet : 0.1.1

+4

json python dask parquet

Daniel Severo 27 . '17 4:01

1

Wes McKinney · Accepted Answer · 2017-07-28T21:05:39+0000

, - Python. Arrow/parquet-cpp (. https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), ( / ). , , Parquet, Impala, Hive, Presto, Drill Spark, SQL, Python.

fastparquet, ( ), , .

, , ( -cpp) , , .

Parquet Nesting with Python

More articles: