Py4J has more overhead than Jython and JPype

Question

Py4J has more overhead than Jython and JPype

After searching for a way to run Java code from a Django (python) application, I found out that Py4J is the best option for me. I tried the Jython, JPype, and Python subprocess, and each one has certain limitations:

Jython My application runs in python.
JPype is wrong. You can start the JVM only once, after which it cannot start again.
Python subprocess. It is not possible to pass a Java object between Python and Java due to a regular console call.

On the Py4J website it says:

In terms of performance, Py4J has more overhead than previous solutions (Jython and JPype) because it relies on sockets, but if performance is critical to your application, accessing Java objects from Python programs may not be the best idea.

In my application performance, it is very important because I work with the Mahout Machine Learning Framework. My question is: will Mahut run slower due to the Py4J gateway server, or does this overhead mean that calling Java methods from Python functions is slower (in the latter case, Mahout performance will not be a problem, and I can use Py4J).

+4

java python mahout py4j

HIP_HOP Aug 28 '13 at 10:03

source share

4 answers

bastian · Answer 1 · 2014-01-13T18:26:42+0000

I do not know Mahut. But think about it: at least with JPype and Py4J you will run into performance when converting types from Java to Python and vice versa. Try to minimize calls between languages. Maybe this is an alternative for you to encode a thin Java shell that condenses many Javacalls into a single python2java call.

Tagar · Answer 2 · 2016-04-24T16:45:47+0000

PySpark uses Py4J quite successfully. If all weightlifting is done on Spark (or Mahout in your case), and you just want to return the result back to the "driver" / Python code, then Py4J may also work for you.

Py4j has slightly larger overhead for huge results (which is not necessarily the case for Spark workloads, since you only return totals / aggregates for data frames). There is a discussion of an improvement for py4j to switch to binary serialization in order to remove these overheads and for higher bandwidth requirements: https://github.com/bartdag/py4j/issues/159

subes · Answer 3 · 2017-06-09T10:27:08+0000

Since performance is also related to your use case (how often you call a script and how large the data is moved), but because different solutions have their own specific advantages / disadvantages, I created an API to switch between different implementations without having to change your python script: https://github.com/subes/invesdwin-context-python

Thus, testing what works best or just flexibility regarding what needs to be deployed is very simple.

user9962007 · Answer 4 · 2019-05-09T04:58:32+0000

The JPype problem that @HIP_HOP mentioned when the JVM disconnected from new threads can be fixed with the following hack (add it before the first call to Java objects in a new thread that does not already have a JVM):

# ensure that current thread is attached to JVM # (essential to prevent JVM / entire container crashes # due to "JPJavaEnv::FindClass" errors) if not jpype.isThreadAttachedToJVM(): jpype.attachThreadToJVM()

Py4J has more overhead than Jython and JPype

More articles: