Checkpoint RDD ReliableCheckpointRDD has a different number of partitions from the original RDD

I have a spark cluster of two machines, and when I launch the spark streaming application, I get the following errors:

Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)

How can I provide a checkpoint directory on a file system that is not HDFS / Cassandra / any other data store?

I thought of two possible solutions, but I don't know how to encode them:

  • there is one remote directory that is local for both workers

  • specify the remote directory as for workers

Any suggestions?

+4
source share
1 answer

Good, so I was able to move on to the first option.

, .

How to mount the remote checkpoint directory on the workers:

sudo apt-get install sshfs
Load it to kernel

sudo modprobe fuse

sudo adduser username fuse

mkdir ~/checkpoint

sshfs ubuntu@xx.xx.x.xx:/home/ubuntu/checkpoint ~/checkpoint
+2

Source: https://habr.com/ru/post/1612439/


All Articles