Executor OutOfMemoryExceptions for small data size and available memory

For a simple case class

  case class Rating(user: Int, item: Int, rating: Double)

and two RDD[Rating]: we get OOME with about 7 million records between two RDDs.

We set up one artist with 30.4 GB of RAM. This is a huge overhead for each entry Rating. I can not justify this: the original primitives for Int and Double are 8 bytes on a 64-bit platform. Then there will be overhead for the presentation of java of the same, as well as for the Case class. But everyone said that we should still be <200 bytes for one entry Ratingin the JVM.

Given 30.4 GB of RAM and 7 M objects, the apparent memory usage will exceed 4 KB for each object. This is not calculated.

Here is the cluster and job information showing 30.4 GB for a single artist:

enter image description here

Note that we also enabled serialization kryoby registering the class Ratingin KryoSerializerand including

 spark.rdd.compress=true

This did not affect OOME.

Is performance RDDmore bloated than memory usage? Or is the problem that most operating RAMs were not used at all for this - and OOME happens for some other reason?

Here is the result of OOME, which happens just a few seconds in a task:

[Dec 06 22:37:32] Generated an implicit feedback dataset with 4501305 ratings for training and 2247105 for test.
Generated dataset in 2644ms
[Stage 0:>                                                          (0 + 1) / 2]Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
    at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
    at org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
    at org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
    at org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$launchTasks$1.apply(CoarseGrainedSchedulerBackend.scala:226)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$launchTasks$1.apply(CoarseGrainedSchedulerBackend.scala:225)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.launchTasks(CoarseGrainedSchedulerBackend.scala:225)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:196)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)

Note: if we use a little less data - for example, a RDDwith 5 million Rating, then the work is relatively fast (<40 seconds) and successful.

So, we are not sure what factors limit the usefulness of the spark worker for such low memory limits.

+4
1

, -, , DRIVER , .

collect, take, groupBy .., count. , count . ​​, .

0

Source: https://habr.com/ru/post/1663134/


All Articles