Corrected Information

I have RDD file names, so RDD [String]. I get this by parallelizing a list of file names (files inside hdfs).

Now I map this rdd and my code opens the hadoop stream using FileSystem.open (path). Then I process it.

When I run my task, I use spark UI / Stages, and I see "location level" = "PROCESS_LOCAL" for all tasks. I don’t think that a spark could reach the locality of data, how do I run a task (in a cluster of 4 data nodes), how is this possible?

+4
source share
2 answers

Data locality is one of the functionality of a spark, which increases its processing speed. The data localization section can be seen here in the spark tuning reference in the data area . Start when you write sc.textFile ("path") at this point the data localization level will match the path you specified, but after this spark tries to make the locality level process_local to optimize processing speed by starting the process in a place where data (locally).

+2
source

When FileSystem.open(path) launched in Spark tasks, File content will be loaded into a local variable in the same JVM process and prepares RDD (section (s)). therefore, the data locality for this RDD is always PROCESS_LOCAL

- vanekjar has already commented on the question


Additional information about data locality in Spark :

There are several location levels based on the current location of the data. In order from nearest to farthest:

  • PROCESS_LOCAL data is in the same JVM as the current code. This is the best area.
  • NODE_LOCAL data is on the same node. Examples can be in HDFS on the same node or in another artist on the same node. This is slightly slower than PROCESS_LOCAL, because data must move between processes.
  • NO_PREF access to data is equally fast from anywhere and has no locality preference
  • RACK_LOCAL are on the same server rack. Data is on another server on the same rack, so they must be sent over the network, usually with a single switch.
  • ANY data is located elsewhere on the network, not in the same rack

Spark prefers to plan all tasks at the best level of terrain , but this is not always possible . In situations where there is no raw data on any inaction performer, Spark switches to lower seat levels.

+2
source

Source: https://habr.com/ru/post/988911/


All Articles