How does DStreams splitting work (for updateStateByKey ()) and how to check it?

I use the operation updateStateByKey()to save state in a Spark Streaming application. The input goes through the Kafka theme.

  • I want to understand how DStreams are separated?
  • How does separation work using the mapWithState () or updateStatebyKey () method?
  • In updateStateByKey (), are the old state and new values ​​for the given key processed on the same node?
  • How common is the shuffle method for the updateStateByKey () method?

The state that I have to support contains ~ 100000 keys, and I want to avoid shuffling every time I update the state, any hints for this?

+4
source share
2 answers

Link to the answer to the same question of Tathagat Das:

https://www.mail-archive.com/ user@spark.apache.org /msg43512.html

The following is the text:

Both mapWithState()and updateStateByKey()use default HashPartitionerand hashed value of the key in the key DStreamon which an operation condition is applied. New data and state is a partition in the same separator, so the same keys to new data (from the input DStream) will be shuffled and placed with RDD with already separated states. Thus, the new data is brought into the corresponding old state on the same machine, and then the function of displaying / updating the state is used.

,

+2

Source: https://habr.com/ru/post/1621569/


All Articles