The source of the problem is the volatile data structure that you use to populate the RDD. When you call sc.parallelize(list) , it does not sc.parallelize(list) state of ArrayList . Since you call clear when you loop out when the data is actually evaluated, there is no data at all.
In truth, I don't know why this behavior changes when you call the count method. Since RDD is not cached, my hunch is that we are talking about the internal components of Spark or JVM, but I will not even try to guess what is actually happening. Maybe someone smarter can understand what exactly is the reason for this behavior.
Just to illustrate what happens:
val arr = Array(1, 2, 3) val rdd = sc.parallelize(arr) (0 until 3).foreach(arr(_) = 99) val tmp = sc.parallelize(arr) tmp.union(rdd).collect
vs.
val arr = Array(1, 2, 3) val rdd = sc.parallelize(arr) rdd.count()
source share