Separation of RDD in spark flow

Sparkstream packets are RDD batches. Introducing a batch of 3 RDD.

In addition, the light documentation says that the block is created every 200 ms by the receiver, and the section is allocated to the block.

Tell me after 1 second I have a batch of 3 RDD, with 5 blocks, if 200 ms is considered.

So, as RDD will be shared between work nodes, there will be a single RDD that will be partitioned or a complete package.

Maybe I was wrong. Please guide me

+5
source share
2 answers

One streaming batch corresponds to one RDD. This RDD will have n partitions, where n = packet interval / interval. Let's say you have a standard interval of 200 ms and a packet interval of 2 seconds, then you will have 10 partitions. Blocks are created by the receiver, and each receiver is allocated in the host. So, these 10 sections are in one node and are replicated to the second node.

When an RDD is sent for processing, the hosts performing the task will read data from that host. Tasks performed on the same node will have a local "NODE_LOCAL", and tasks performed on other nodes will have ANY locations and will take longer.

Therefore, to improve parallel processing, it is recommended to select several receivers and use the union to create a single DStream for further processing. Thus, the data will be consumed and processed by several nodes in parallel.

+14
source

Is this still applicable in the newer version of the spark?

I read an article where the scenario with multiple receivers on a spark is out of date, and instead a new direct kafka api ( createDirectStream ) will be used to take care of a lot for you.

0
source

Source: https://habr.com/ru/post/1233133/


All Articles