What makes FS HDFS "compatible"? it is a file system with the behavior specified in the Hadoop FS specification . The difference between the object store and the FS is covered there, with the key point being "ultimately agreed storage objects without adding or O (1) atomic renaming is not compatible"
For S3 in particular
- This is consistent: after creating a new blob, the list command often does not display it. The same for removal.
- When a drop is overwritten or deleted, it may take some time to leave.
- rename () is implemented by the copy and then deletes
Checkpoints Spark streaming, storing everything in a location, and then renaming it to a checkpoint directory. This makes the time for the control point proportional to the time for copying data to S3, which is ~ 6-10 MB / s.
The current bit of the stream code is not suitable for s3, and although at some point I can fix it, I will not send any new patches until they install my old ones. It makes no sense for me to work on this material if it is ignored.
Now do one of
- to HDFS and then copy the results
- per EBS bit allocated and attached to your cluster
- on S3, but they have a long gap between breakpoints, so time to breakpoint does not disable your streaming application.
If you use EMR, you can pay a premium for the serial, supported by the S3 dynamo server, which gives you better consistency. But the copy time remains the same, so the breakpoint will be just as slow.
source share