Spark Python Performance Tuning

Question

Spark Python Performance Tuning

I developed iPython to develop Spark using the following command:

ipython notebook --profile=pyspark

And I created sc SparkContext using Python code as follows:

 import sys import os os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf" sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python") sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip") from pyspark import SparkContext, SparkConf from pyspark.sql import * sconf = SparkConf() conf = (SparkConf().setMaster("spark://701.datafireball.com:7077") .setAppName("sparkapp1") .set("spark.executor.memory", "6g")) sc = SparkContext(conf=conf) sqlContext = SQLContext(sc)

I want to better understand spark.executor.memory in a document

The amount of memory for each executing process in the same format as the JVM memory lines

Does this mean that the accumulated memory of all processes running on one node will not exceed this cap? If so, should this number be set to the maximum possible number?

Here is also a list of some properties, are there any other options that I can configure by default to improve performance.

Thanks!

+5

apache-spark pyspark

B.Mr.W. Jan 03 '15 at 16:48

source share

2 answers

Does this mean that the accumulated memory of all running processes, one node will not exceed this cap? If so, should I set this number to the maximum possible number?

Nope. Usually you have several artists on node. Therefore, spark.executor.memory indicates how much memory one artist can take.

You should also check spark.driver.memory and configure it if you expect a significant amount of data to be returned from Spark.

And yes, this partially covers Python memory. The part that is interpreted as Py4J and runs in the JVM.

Spark uses Py4J internally to translate your Java code and runs it as such. For example, if you have a Spark pipeline as lambda functions on RDD, then this Python code will be executed on executors through Py4J. On the other hand, if you run rdd.collect () and then do something with this as a local Python variable that will work through Py4J on your driver.

0

Tagar Feb 05 '16 at 8:25

source share

Vlad Frolov · Accepted Answer · 2015-01-15T16:44:05+0000

Does this mean that the accumulated memory of all running processes, one node will not exceed this cap?

Yes, if you use Spark in YARN-client mode, otherwise it limits only the JVM.

However, with this parameter, YARN is a tricky thing. YARN limits the accumulated memory to spark.executor.memory , and Spark uses the same limit for the JVM executor, in Python there is no memory for such restrictions, so I had to disable YARN restrictions.

Regarding an honest answer to your question according to your stand-alone Spark configuration: No, spark.executor.memory does not limit Python memory allocation.

BTW, setting the SparkConf option does not affect Spark standalone artists, as they are already taken. Learn more about conf/spark-defaults.conf

If so, should this number be set to the maximum possible number?

You must set it to a balanced number. There is a specific feature of the JVM: it will ultimately spark.executor.memory and never make it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT since it will take up all the memory for Java.

In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5 , which means that 0.6 * spark.executor.memory will be used by the Spark cache, 0.4 * spark.executor.memory is the JVM executor and 0.5 * spark.executor.memory - using Python.

You can also configure spark.storage.memoryFraction , which defaults to 0.6 .

Spark Python Performance Tuning

More articles: