Use external library in pyspark task in Spark cluster from google-dataproc

Question

Use external library in pyspark task in Spark cluster from google-dataproc

I have a spark cluster created through google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv ). So I first tested it like this:

I started an ssh session with the node master file of my cluster, then I entered:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0

He then launched the pyspark shell into which I enter:

 df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv') df.show()

And it worked.

The next step is to run this task from my main machine using the command:

 gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py

But this does not work here, and I get an error message. I think because I did not give --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give this, and I could not handle it.

My question is:

there was a csv databricks library installed after entering pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Is it possible to write a line in my job.py to import it?
or what parameters should I pass to the gcloud command to import or install it?

+5

import apache-spark pyspark google-cloud-dataproc

sweeeeeet Oct 27 '15 at 8:38

source share

2 answers

In addition to @Dennis.

Note that if you need to download multiple external packages, you need to specify a custom escape character, for example:

 --properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌bricks:spark-avro_2.10:2.0.1

Pay attention to ^ # ^ right before the list of packages. See gcloud topic escaping more details.

+2

cerisier Jul 27 '16 at 13:44

source share

Dennis huo · Accepted Answer · 2015-10-29T00:49:32+0000

Short answer

There is a quirk in ordering arguments, where --packages not accepted spark-submit if it comes after the argument my_job.py . To get around this, you can do the following when sending from the Dataproc CLI:

 gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \ --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.

Long answer

So this is actually a different issue than the well-known lack of support --jars in gcloud beta dataproc jobs submit pyspark ; that without Dataproc explicitly recognizes --packages as a special spark-submit -level flag, it tries to pass it after the application arguments, so spark-submit allows --packages fail as an application argument, and not correctly parse it as a feed level option. Indeed, in an SSH session, the following does not work:

 # Doesn't work if job.py depends on that package. spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0

But switching the order of the arguments works again, although in the case of pyspark both orders work:

 # Works with dependencies on that package. spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0 pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py

Thus, while spark-submit job.py supposed to be a replacement for everything previously called pyspark job.py , the difference in parsing ordering for things like --packages means that this is not really a 100% compatible migration. This may be something that should be undertaken on the spark side.

Anyway, fortunately, there is a workaround, since --packages is just another alias for the Spark spark.jars.packages , and the Dataproc CLI supports the properties just fine. Therefore, you can simply do the following:

 gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \ --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Please note that --properties must be before my_job.py , otherwise it will be sent as an application argument, and not as a configuration flag. Hope this works for you! Note that the equivalent in an SSH session will be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py

Use external library in pyspark task in Spark cluster from google-dataproc

More articles: