Spark: the dangers of using Python

Question

Spark: the dangers of using Python

In Spark: The Ultimate Guide (currently an early edition of the text is subject to change), authors advise against using Pyspark for custom functions in Spark:

“Running this Python process is expensive, but the real cost is serializing the data in Python. It’s expensive for two reasons: it’s an expensive computation, but also as soon as the data goes into Python, Spark cannot manage the memory. This means you can potentially bring worker failure if it becomes a limited resource (since both the JVM and python are competing for memory on the same machine).

I understand that competition for node work resources between Python and the JVM can be a serious problem. But does this not apply to the driver? In this case, it will be an argument against using Pyspark. Can someone explain what distinguishes the situation from the driver?

+4

python scala user-defined-functions apache-spark pyspark

Mightycurious Sep 22 '17 at 4:29

source share

2 answers

collect . , .

: . .

, Spark , . .

, , .

0

DoctorPangloss 22 . '17 6:21

user6910411 · Accepted Answer · 2017-09-22T08:15:12+0000

In any case, this is more of an argument against using Python UDF than PySpark in general, and to a lesser extent, a similar argument can be made against embedded (implemented in the JVM) UDF.

You should also notice that vectorized UDFs are on the Spark roadmap, so:

the real cost is serializing data in Python

may no longer be a problem in the future.

?

. node ( ), UDF - .

, , API RDD, JVM , . Python, , Python.

Spark: the dangers of using Python

More articles: