Checkpoint RDD ReliableCheckpointRDD has a different number of partitions from the original RDD

Question

Checkpoint RDD ReliableCheckpointRDD has a different number of partitions from the original RDD

I have a spark cluster of two machines, and when I launch the spark streaming application, I get the following errors:

Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)

How can I provide a checkpoint directory on a file system that is not HDFS / Cassandra / any other data store?

I thought of two possible solutions, but I don't know how to encode them:

there is one remote directory that is local for both workers
specify the remote directory as for workers

Any suggestions?

+4

apache-spark apache-spark-ml spark-streaming

Soumitra Oct 20 '15 at 2:04

source share

1 answer

Soumitra · Accepted Answer · 2015-10-20T14:55:36+0000

Good, so I was able to move on to the first option.

, .

How to mount the remote checkpoint directory on the workers:

sudo apt-get install sshfs
Load it to kernel

sudo modprobe fuse

sudo adduser username fuse

mkdir ~/checkpoint

sshfs ubuntu@xx.xx.x.xx:/home/ubuntu/checkpoint ~/checkpoint

Checkpoint RDD ReliableCheckpointRDD has a different number of partitions from the original RDD

More articles: