It seems unnecessary to use Spark directly ... If this data is "collected" by the driver, why not use the HDFS API? Often, Hadoop comes with Spark. Here is an example:
import org.apache.hadoop.fs._ import org.apache.hadoop.conf._ val fileSpec = "/data/Invoices/20171123/21" val conf = new Configuration() val fs = org.apache.hadoop.fs.FileSystem.get(new URI("hdfs://nameNodeEneteredHere"),conf) val path = new Path(fileSpec) // if(fs.exists(path) && fs.isDirectory(path) == true) ... val fileList = fs.listStatus(path)
Then, with println(fileList(0)) information (formatted) similar to this first element (as an example) can be seen as org.apache.hadoop.fs.FileStatus :
FileStatus { path=hdfs://nameNodeEneteredHere/Invoices-0001.avro; isDirectory=false; length=29665563; replication=3; blocksize=134217728; modification_time=1511810355666; access_time=1511838291440; owner=codeaperature; group=supergroup; permission=rw-r--r--; isSymlink=false }
Where fileList(0).getPath will give hdfs://nameNodeEneteredHere/Invoices-0001.avro .
I assume that this means that reading files will be primarily using namenode HDFS, and not inside each artist. TL; DR; I am sure that Spark will most likely poll the namenode to get the RDD. If a basic Spark call queries the namenode for RDD control, perhaps the above is an effective solution. Nevertheless, informative comments suggesting any direction will be welcome.
source share