Running a custom Java class in PySpark

I am trying to run my own HDFS reading class in PySpark. This class is written in Java, and I need to access it from PySpark, either from the shell or using spark-submit.

In PySpark, I retrieve JavaGateway from SparkContext ( sc._gateway ).

Say I have a class:

 package org.foo.module public class Foo { public int fooMethod() { return 1; } } 

I tried packing it in a jar and passing it using the --jar in pyspark and then running:

 from py4j.java_gateway import java_import jvm = sc._gateway.jvm java_import(jvm, "org.foo.module.*") foo = jvm.org.foo.module.Foo() 

But I get the error:

 Py4JError: Trying to call a package. 

Can anyone help with this? Thanks.

+3
source share
3 answers

The problem you described usually indicates that org.foo.module not in the CLASSPATH driver. One possible solution is to use spark.driver.extraClassPath to add a jar file. It can, for example, be installed in conf/spark-defaults.conf or provided as a command line parameter.

On a side note:

  • If the class you are using is a custom input format, there should be no need to use a Py4j gateway. You can simply use the SparkContext.hadoop* / SparkContext.newAPIHadoop* methods.

  • using java_import(jvm, "org.foo.module.*") seems like a bad idea. Generally speaking, you should avoid unnecessary import on the JVM. This is not a public for some reason, and you really do not want to enter into it with this. Especially when you gain access so that this import is completely out of date. So leave java_import and stick with jvm.org.foo.module.Foo() .

+2
source

In PySpark try the following

 from py4j.java_gateway import java_import java_import(sc._gateway.jvm,"org.foo.module.Foo") func = sc._gateway.jvm.Foo() func.fooMethod() 

Make sure you compile Java code into an executable jar and submit the spark job this way

 spark-submit --driver-class-path "name_of_your_jar_file.jar" --jars "name_of_your_jar_file.jar" name_of_your_python_file.py 
+2
source

Instead of --jars you should use --packages to import packages into your spark-submit action.

0
source

Source: https://habr.com/ru/post/1244660/


All Articles