I wrote a piece of code that reads several parquet files and caches them for later use. My code looks so simplistic
val data = SparkStartup.sqlContext.read.parquet(...)
data.setName(...).persist(StorageLevel.MEMORY_AND_DISK_SER).collect()
map += data
The parquet file is only about 11 g. I configure my application to:
val sparkConfig = new SparkConf().setAppName(...).setMaster("local[128]")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.set("spark.kryoserializer.buffer.max", "512m");
sparkConfig.set("spark.kryoserializer.buffer", "256");
sparkConfig.set("spark.driver.maxResultSize", "0");
sparkConfig.set("spark.driver.memory", "9g");
I thought that using MEMORY_AND_DISK_SERSpark would go to disk if too much memory was used. However, I get `java.lang.OutOfMemoryError: heap errors of Java errors in
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:105)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:81)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
Why is this? I start my application with -Xmx9g -Dspark.executor.memory=9g -Dspark.executor.cores=3. For files that are read before everything crashes, I see in SparkUI that parquet files take up 9x of its size when reading into memory.
source
share