Creating parquet files in sparks with row group size that is less than 100

Question

Creating parquet files in sparks with row group size that is less than 100

I have a data light frame having a small number of fields. Some of the fields are huge binary drops. The size of the entire line is about 50 MB.

I save the data frame in parquet format. I control the size of the row group using the parquet.block.size parameter.

Spark will generate a parquet file, however I always get at least 100 lines in a group of lines. This is a problem for me, as block sizes can become gigabytes that do not work very well with my application.

parquet.block.size works as expected if the size is large enough to accommodate more than 100 lines.

I changed InternalParquetRecordWriter.java as MINIMUM_RECORD_COUNT_FOR_CHECK = 2 , which MINIMUM_RECORD_COUNT_FOR_CHECK = 2 problem, however there is no configuration value that I can find that will support setting this hard-coded constant.

Is there any other / better way to get the size of a row group smaller than 100?

This is a piece of code:

 from pyspark import Row from pyspark.sql import SparkSession import numpy as np from pyspark.sql.types import StructType, StructField, BinaryType def fake_row(x): result = bytearray(np.random.randint(0, 127, (3 * 1024 * 1024 / 2), dtype=np.uint8).tobytes()) return Row(result, result) spark_session = SparkSession \ .builder \ .appName("bbox2d_dataset_extraction") \ .config("spark.driver.memory", "12g") \ .config("spark.executor.memory", "4g") spark_session.master('local[5]') spark = spark_session.getOrCreate() sc = spark.sparkContext sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 8 * 1024 * 1024) index = sc.parallelize(range(50), 5) huge_rows = index.map(fake_row) schema = StructType([StructField('f1', BinaryType(), False), StructField('f2', BinaryType(), False)]) bbox2d_dataframe = spark.createDataFrame(huge_rows, schema).coalesce(1) bbox2d_dataframe. \ write.option("compression", "none"). \ mode('overwrite'). \ parquet('/tmp/huge/')

+5

hadoop apache-spark parquet

user1154333 Jan 9 '18 at 10:51

source share

1 answer

Pradeep gollakota · Accepted Answer · 2018-01-10T02:33:11+0000

Unfortunately, I did not find a way to do this. I reported this problem to remove hard-coded values and make them customizable. I have a patch for it if you're interested.

Creating parquet files in sparks with row group size that is less than 100

More articles: