How Spark interacts with CPython

Question

How Spark interacts with CPython

I have an Akka system written in scala that should call Python code based on Pandas and Numpy , so I can't just use Jython. I noticed that Spark uses CPython on its work nodes, so I'm curious how it executes Python code and whether this code exists in some reusable form.

+7

pandas scala interop apache-spark pyspark

Arne Claassen Jun 06 '15 at 16:18

source share

2 answers

So, Spark uses py4j to communicate between the JVM and Python. This allows Spark to work with different versions of Python, but data transfer requires serialization of data from the JVM and vice versa. For more information about py4j, see http://py4j.sourceforge.net/ , hope this helps :)

+3

Holden Jun 07 '15 at 7:22

source share

vvladymyrov · Accepted Answer · 2015-06-07 13:47

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals .

PySpark internals

As @Holden said, Spark uses py4j to access Java objects in the JVM from python. But this is only one case - when the driver program is written in python (the left side of the diagram is there)

Another case (the right side of the diagram) is when Spark Worker starts the Python process and sends the serialized Java objects to the python program for processing and receives output. Java objects are serialized to pickle format, so python can read them.

It seems like the last thing. Here are some links to the Spark scala core that may be useful to you:

Pyrolite library that provides a Java interface for Iberian picket protocols - used by Spark to serialize Java objects in pickle format. For example, such a transformation is required to access the key part of the keys, pairs of values for PairRDD.
Scala that starts the python process and iterates with it: api / python / PythonRDD.scala
SerDeser utilities that perform code collection: api / python / SerDeUtil.scala
Python side: python / pyspark / worker.py

How Spark interacts with CPython

More articles: