The Flink stream processing model is very different from Spark Streaming, which is centered around mini-packages. In Spark Streaming, each mini-batch runs as a normal batch program on a finite data set, while Flink DataStream programs continuously process recordings.
In the Flink DataSet API, a MapPartitionFunction has two parameters. An iterator for input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program will never return from the first function call because the iterator will iterate over an endless stream of records. However, the Flink internal flow model requires that user functions return for the state of the checkpoint function. Therefore, the DataStream API does not offer mapPartition conversions.
To implement functionality similar to Spark Streaming reduceByKey , you need to define a window with a key above the stream. Windows discretizes streams that are somewhat similar to mini-parties, but windows offer more flexibility. Since the window has a finite size, you can call reduce window.
It might look like this:
yourStream.keyBy("myKey") // organize stream by key "myKey" .timeWindow(Time.seconds(5)) // build 5 sec tumbling windows .reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define different types of windows and explain all the available functions.
Note. The DataStream API has been redesigned recently. The example assumes the latest version (0.10-SNAPSHOT), which will be released as 0.10.0 in the following days.
source share