In order for Spark to distribute a given operation, the function used in the operation must be serialized. Before serialization, these functions go through a complex process called the ClosureCleaner .
The goal is to βdisconnectβ closures from their context in order to reduce the size of the graph of the object that needs to be serialized, and to reduce the risk of serialization problems in the process. In other words, make sure that only the code needed to execute the function is serialized and sent to deserialize and execute "on the other hand"
During this process, closure is also evaluated as Serializable to be active in detecting serialization problems at run time ( SparkContext # clean ).
This code is dense and complex, so itβs hard to find the right code path leading to this case.
Intuitively , what happens when a ClosureCleaner
finds:
val result = rdd.map{elem => funcs.func_1(elem) }
It calculates the internal members of the closure as objects that can be recreated, but there are no further links , so the cleared closure contains {elem => funcs.func_1(elem)}
, which can be serialized by the JavaSerializer
.
Instead, when the closing cleaner evaluates:
val handler = funcs val result = rdd.map(elem => { handler.func_1(elem) })
He found that the closure has a reference to $outer
( handler
), therefore, it checks the outer scope and adds an instance of the variable and variable to the cleared closure. We could imagine that the resulting cleared closure would be something like this form (this is for illustrative purposes only):
{elem => val handler = funcs handler.func_1(elem) }
When a closure is checked for serialization , it cannot serialize. By JVM serialization rules, an object is serialized if all its members are serializable recursively. In this case, the handler
refers to a non-serializable object, and the validation fails.