The most efficient way to access binary files in ADLS from a working node in PySpark?

Question

The most efficient way to access binary files in ADLS from a working node in PySpark?

I deployed an Azure HDInsight cluster with rwx permissions for all directories in the Azure Data Lake Store, which also serves as a storage account. On the head of the node, I can load, for example. image data from ADLS using a command such as:

my_rdd = sc.binaryFiles('adl://{}.azuredatalakestore.net/my_file.png')

Workers do not have access to the SparkContext function binaryFiles(). I can use the azure-datalake-storePython SDK to download the file, but it looks a lot slower. I assume that it does not realize any advantages of communication between a cluster and ADLS.

Is there a faster way to download files from related ADLS to working?

Further context, if necessary:

I use PySpark to apply a deep learning learning model to a large set of images. Since the model is time consuming, my ideal would be as follows:

Send each employee a partial list of image URIs for processing (by applying mapPartition()to an RDD containing a complete list)
Make sure the user uploads data for one image at a time using the model
Return model results for a set of images

Since I don’t know how to efficiently upload images to workers, the best option at the moment is to split the RDD containing image byte data, which (I suppose) is inefficient in memory and creates a bottleneck because the node head performs all the data loading.

+4

python apache-spark pyspark azure-data-lake

mewahl 20 . '17 16:36

1

Alexandre Gattiker · Answer 1 · 2017-01-24T08:07:21+0000

HDInsight HDFS.

hdfs dfs -ls /user/digdug/images/
Found 3 items
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/a.png
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/b.png
-rw-r--r--   1    digdug supergroup       1945 2017-01-24 08:01 /user/digdug/images/c.png

pyspark:

rdd = sc.binaryFiles("/user/digdug/images")

def f(iterator):
    sizes = []
    for i in iterator:
        sizes.append(len(i[1]))
    return sizes

rdd.mapPartitions(f).collect()

:

[4957, 4957, 1945]

The most efficient way to access binary files in ADLS from a working node in PySpark?

More articles: