I am trying to understand how task serialization works in Spark, and I'm a little confused by some of the mixed results that I get in the test I wrote.
I have some test code (simplified for mail) that does the following through more than node:
object TestJob {
def run(): Unit = {
val rdd = ...
val helperObject = new Helper() // Helper does NOT impl Serializable and is a vanilla class
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
When I execute run()
, the task starts with the exception "task not serializable", as expected, since helperObject
it is not serializable. HOWEVER, when I change this a bit, for example:
trait HelperComponent {
val helperObject = new Helper()
}
object TestJob extends HelperComponent {
def run(): Unit = {
val rdd = ...
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
The task is being performed for some reason. Can someone help me figure out why this might be? What exactly is Spark serialized and sent to the workers above in each case?
I am using Spark version 2.1.1.
Thank!