Corrected Information

Question

Corrected Information

I have RDD file names, so RDD [String]. I get this by parallelizing a list of file names (files inside hdfs).

Now I map this rdd and my code opens the hadoop stream using FileSystem.open (path). Then I process it.

When I run my task, I use spark UI / Stages, and I see "location level" = "PROCESS_LOCAL" for all tasks. I don’t think that a spark could reach the locality of data, how do I run a task (in a cluster of 4 data nodes), how is this possible?

+4

hadoop hdfs apache-spark

kostas.kougios Jun 23 '15 at 15:10

source share

2 answers

Kaab awan · Answer 1 · 2015-06-28T15:09:24+0000

Data locality is one of the functionality of a spark, which increases its processing speed. The data localization section can be seen here in the spark tuning reference in the data area . Start when you write sc.textFile ("path") at this point the data localization level will match the path you specified, but after this spark tries to make the locality level process_local to optimize processing speed by starting the process in a place where data (locally).

mrsrinivas · Answer 2 · 2016-11-03T18:09:01+0000

When FileSystem.open(path) launched in Spark tasks, File content will be loaded into a local variable in the same JVM process and prepares RDD (section (s)). therefore, the data locality for this RDD is always PROCESS_LOCAL
- vanekjar has already commented on the question

Additional information about data locality in Spark :

There are several location levels based on the current location of the data. In order from nearest to farthest:

PROCESS_LOCAL data is in the same JVM as the current code. This is the best area.
NODE_LOCAL data is on the same node. Examples can be in HDFS on the same node or in another artist on the same node. This is slightly slower than PROCESS_LOCAL, because data must move between processes.
NO_PREF access to data is equally fast from anywhere and has no locality preference
RACK_LOCAL are on the same server rack. Data is on another server on the same rack, so they must be sent over the network, usually with a single switch.
ANY data is located elsewhere on the network, not in the same rack

Spark prefers to plan all tasks at the best level of terrain , but this is not always possible . In situations where there is no raw data on any inaction performer, Spark switches to lower seat levels.

Corrected Information

More articles: