How to profile pyspark work

Question

How to profile pyspark work

I want to understand profiling in pyspark codes.

After that: https://github.com/apache/spark/pull/2351

>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)

The above works fine. But if I do something like below:

from pyspark.sql import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("myapp").set("spark.python.profile","true")
sc   = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

df=sqlContext.sql("select * from myhivetable")
df.count()
sc.show_profiles()

It doesn’t give me anything. I get the bill but show_profiles()give meNone

Any help appreciated

+4

profiler apache-spark pyspark apache-spark-sql spark-dataframe

sau Aug 31 '16 at 15:08

source share

1 answer

user6022341 · Accepted Answer · 2016-08-31T16:00:28+0000

No Python code for profile when using Spark SQL. The only Python is to call the Scala engine. Everything else is done in the Java virtual machine.

How to profile pyspark work

More articles: