How to get the current timestamp in a Spark stream

How to get the current packet timestamp (DStream) in a Spark stream?

I have a sparking application in which many input transformations will go through.

I need the current timestamp at runtime to check the timestamp in the input.

If I compare with the current time, then the timestamp may be different from each execution of the RDD conversion.

Is there a way to get a timestamp when a specific Spark micro-charge is running or what is the interval for its periodic loading?

+4
source share
3 answers
dstream.foreachRDD((rdd, time)=> {
  // time is scheduler time for the batch job.it interval was your window/slide length.
})
+6
source
dstream.transform(
    (rdd, time) => {
        rdd.map(
            (time, _)
        )
    }
).filter(...)
+2

Late answer ... but still, if that helps anyone, the timestamp can be retrieved in milliseconds. First, define a function using the Java API for formatting:

Using Java 7 - util.Date/DateFormat style:

def returnFormattedTime(ts: Long): String = {
    val date = new Date(ts)
    val formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    val formattedDate = formatter.format(date)
    formattedDate
}

Or, using Java 8-style util.time:

def returnFormattedTime(ts: Long): String = {
    val date = Instant.ofEpochMilli(ts)
    val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss").withZone(ZoneId.systemDefault())
    val formattedDate = formatter.format(date)
    formattedDate
}

Finally, use the foreachRDD method to get the timestamp:

dstreamIns.foreachRDD((rdd, time) =>
    ....
    println(s"${returnFormattedTime(time.milliseconds)}")
    ....
)
+2
source

Source: https://habr.com/ru/post/1621467/


All Articles