How Spark interacts with CPython

I have an Akka system written in scala that should call Python code based on Pandas and Numpy , so I can't just use Jython. I noticed that Spark uses CPython on its work nodes, so I'm curious how it executes Python code and whether this code exists in some reusable form.

+7
pandas scala interop apache-spark pyspark
Jun 06 '15 at 16:18
source share
2 answers

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals .

PySpark internals

As @Holden said, Spark uses py4j to access Java objects in the JVM from python. But this is only one case - when the driver program is written in python (the left side of the diagram is there)

Another case (the right side of the diagram) is when Spark Worker starts the Python process and sends the serialized Java objects to the python program for processing and receives output. Java objects are serialized to pickle format, so python can read them.

It seems like the last thing. Here are some links to the Spark scala core that may be useful to you:

+9
Jun 07 '15 at 13:47
source share

So, Spark uses py4j to communicate between the JVM and Python. This allows Spark to work with different versions of Python, but data transfer requires serialization of data from the JVM and vice versa. For more information about py4j, see http://py4j.sourceforge.net/ , hope this helps :)

+3
Jun 07 '15 at 7:22
source share



All Articles