Does Google BigQuery support the Parquet file format?

I was wondering if Google BigQuery supports the parquet file format or if there are plans to support it?

I know that it currently supports CSV and JSON formats.

+5
source share
4 answers

** As of March 1, 2018, support for downloading Parquet files is available.

With version 2.0.24 of the BigQuery CLI, there is an option --source_format PARQUET , which is described in the output of bq --help .

If I try to use it in my project, I get an error message. Based on reading the BigQuery related ticket, Parquet download support seems to be currently available as an invitation only.

% bq load --source_format PARQUET test.test3 data.avro.parquet schema.json Upload complete. Waiting on bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'msgqegcp:bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1': Loading of parquet file format is not enabled

My information is that the Parquet file is half the size of the Avro file. I wanted to try something new and load the data efficiently (in that order).

+6
source

BigQuery does not currently support the Parquet file format. However, we are interested in learning more about your use case - are you interested in importing, exporting, or both? How are you going to use it? Understanding the scripts will help the BigQuery team better plan accordingly.

+1
source

If you want to share the file format between BigQuery and Hadoop, you can use JSON entries separated by a newline.

BigQuery supports them for import and export.

Hadoop also supports this. A search on the Internet finds many hits showing recipes for its work. Here's one: Handling JSON using java Mapreduce

+1
source

When you are dealing with hundreds of millions of rows and you need to move data to a local Hadoop cluster, this, exporting from bigQuery, json is just an impossible option, avro is not much better, the only effective option today is for moving gz data that, unfortunately impossible to read initially in Hadoop, Larquet is the only effective way for this use case, we have no other effective option

0
source

Source: https://habr.com/ru/post/1234613/


All Articles