I have a data light frame having a small number of fields. Some of the fields are huge binary drops. The size of the entire line is about 50 MB.
I save the data frame in parquet format. I control the size of the row group using the parquet.block.size parameter.
Spark will generate a parquet file, however I always get at least 100 lines in a group of lines. This is a problem for me, as block sizes can become gigabytes that do not work very well with my application.
parquet.block.size works as expected if the size is large enough to accommodate more than 100 lines.
I changed InternalParquetRecordWriter.java as MINIMUM_RECORD_COUNT_FOR_CHECK = 2 , which MINIMUM_RECORD_COUNT_FOR_CHECK = 2 problem, however there is no configuration value that I can find that will support setting this hard-coded constant.
Is there any other / better way to get the size of a row group smaller than 100?
This is a piece of code:
from pyspark import Row from pyspark.sql import SparkSession import numpy as np from pyspark.sql.types import StructType, StructField, BinaryType def fake_row(x): result = bytearray(np.random.randint(0, 127, (3 * 1024 * 1024 / 2), dtype=np.uint8).tobytes()) return Row(result, result) spark_session = SparkSession \ .builder \ .appName("bbox2d_dataset_extraction") \ .config("spark.driver.memory", "12g") \ .config("spark.executor.memory", "4g") spark_session.master('local[5]') spark = spark_session.getOrCreate() sc = spark.sparkContext sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 8 * 1024 * 1024) index = sc.parallelize(range(50), 5) huge_rows = index.map(fake_row) schema = StructType([StructField('f1', BinaryType(), False), StructField('f2', BinaryType(), False)]) bbox2d_dataframe = spark.createDataFrame(huge_rows, schema).coalesce(1) bbox2d_dataframe. \ write.option("compression", "none"). \ mode('overwrite'). \ parquet('/tmp/huge/')
source share