I deployed an Azure HDInsight cluster with rwx permissions for all directories in the Azure Data Lake Store, which also serves as a storage account. On the head of the node, I can load, for example. image data from ADLS using a command such as:
my_rdd = sc.binaryFiles('adl://{}.azuredatalakestore.net/my_file.png')
Workers do not have access to the SparkContext function binaryFiles(). I can use the azure-datalake-storePython SDK to download the file, but it looks a lot slower. I assume that it does not realize any advantages of communication between a cluster and ADLS.
Is there a faster way to download files from related ADLS to working?
Further context, if necessary:
I use PySpark to apply a deep learning learning model to a large set of images. Since the model is time consuming, my ideal would be as follows:
- Send each employee a partial list of image URIs for processing (by applying
mapPartition()to an RDD containing a complete list) - Make sure the user uploads data for one image at a time using the model
- Return model results for a set of images
Since I don’t know how to efficiently upload images to workers, the best option at the moment is to split the RDD containing image byte data, which (I suppose) is inefficient in memory and creates a bottleneck because the node head performs all the data loading.