Converted DStream to pyspark gives an error when calling pprint

Question

Converted DStream to pyspark gives an error when calling pprint

I learn Spark Streaming through PySpark and hit the error when I try to use function transformc take.

I can successfully use sortByfor DStreamthrough transformand pprintresult.

author_counts_sorted_dstream = author_counts_dstream.transform\
  (lambda foo:foo\
   .sortBy(lambda x:x[0].lower())\
   .sortBy(lambda x:x[1],ascending=False))
author_counts_sorted_dstream.pprint()

But if I use takethe same pattern and try pprintthis:

top_five = author_counts_sorted_dstream.transform\
  (lambda rdd:rdd.take(5))
top_five.pprint()

job failure using

Py4JJavaError: An error occurred while calling o25.awaitTermination.
: org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/local/spark/python/pyspark/streaming/util.py", line 67, in call
    return r._jrdd
AttributeError: 'list' object has no attribute '_jrdd'

You can see the full code and put it in a notebook here .

What am I doing wrong?

+2

apache-spark pyspark spark-streaming dstream

Robin moffatt Jan 05 '17 at 11:20

source share

1 answer

user6910411 · Accepted Answer · 2017-01-05T12:40:29+0000

, transform, RDD RDD. , take, RDD:

sc: SparkContext = ...

author_counts_sorted_dstream.transform(
  lambda rdd: sc.parallelize(rdd.take(5))
)

RDD.sortBy ( RDD), .

:

lambda foo: foo \
    .sortBy(lambda x:x[0].lower()) \
    .sortBy(lambda x:x[1], ascending=False)

. , Spark sort by shuffle . , , :

lambda x: (x[0].lower(), -x[1])

Converted DStream to pyspark gives an error when calling pprint

More articles: