I have RDD file names, so RDD [String]. I get this by parallelizing a list of file names (files inside hdfs).
Now I map this rdd and my code opens the hadoop stream using FileSystem.open (path). Then I process it.
When I run my task, I use spark UI / Stages, and I see "location level" = "PROCESS_LOCAL" for all tasks. I donβt think that a spark could reach the locality of data, how do I run a task (in a cluster of 4 data nodes), how is this possible?
source share