One streaming batch corresponds to one RDD. This RDD will have n partitions, where n = packet interval / interval. Let's say you have a standard interval of 200 ms and a packet interval of 2 seconds, then you will have 10 partitions. Blocks are created by the receiver, and each receiver is allocated in the host. So, these 10 sections are in one node and are replicated to the second node.
When an RDD is sent for processing, the hosts performing the task will read data from that host. Tasks performed on the same node will have a local "NODE_LOCAL", and tasks performed on other nodes will have ANY locations and will take longer.
Therefore, to improve parallel processing, it is recommended to select several receivers and use the union to create a single DStream for further processing. Thus, the data will be consumed and processed by several nodes in parallel.
maasg source share