Spark: the dangers of using Python

In Spark: The Ultimate Guide (currently an early edition of the text is subject to change), authors advise against using Pyspark for custom functions in Spark:

“Running this Python process is expensive, but the real cost is serializing the data in Python. It’s expensive for two reasons: it’s an expensive computation, but also as soon as the data goes into Python, Spark cannot manage the memory. This means you can potentially bring worker failure if it becomes a limited resource (since both the JVM and python are competing for memory on the same machine).

I understand that competition for node work resources between Python and the JVM can be a serious problem. But does this not apply to the driver? In this case, it will be an argument against using Pyspark. Can someone explain what distinguishes the situation from the driver?

+4
source share
2 answers

In any case, this is more of an argument against using Python UDF than PySpark in general, and to a lesser extent, a similar argument can be made against embedded (implemented in the JVM) UDF.

You should also notice that vectorized UDFs are on the Spark roadmap, so:

the real cost is serializing data in Python

may no longer be a problem in the future.

?

. node ( ), UDF - .

, , API RDD, JVM , . Python, , Python.

+1

collect . , .

: . .

, Spark , . .

, , .

0

Source: https://habr.com/ru/post/1686174/


All Articles