Can I keep millions of keys in Spark Streaming state for two months?

I am trying to solve a problem (simplified here) in Spark Streaming: let's say I have a log of events made by users, where each event is a tuple (username, activity, time), for example:

("user1", "view", "2015-04-14T21:04Z") ("user1", "click", "2015-04-14T21:05Z") 

Now I would like to collect events from the user to do some analysis of this. Let's say that the output is some analysis:

 ("user1", List(("view", "2015-04-14T21:04Z"),("click", "2015-04-14T21:05Z")) 

Events should be stored even for 2 months . During this time, there may be about 500 million such events and millions of unique users who are the keys here.

My questions:

  • Is it possible to accomplish such a thing with updateStateByKey in a DStream when I have millions of keys?
  • I am right that DStream.window is useless here when I have a window with a length of 2 months and you want to make a slide in a few seconds?

PS I found out that updateStateByKey is called for all the keys on each slide, so this means that it will be called millions of times every few seconds. This makes me doubt this design, and I rather think of alternative solutions, for example:

  • using cassandra for state
  • using Trident state (possibly with Cassandra)
  • using Samza with his state management.
+6
source share
2 answers

I think it depends on how you will request data in the future. I have similar scenarios. I just did the conversion through mapPartitions and reduceByKey and saved the data in Cassandra.

+1
source

It looks like you need an exponentially fading window.

You can read about them in Rajaraman, Anand and Jeffrey David Ullman. Extraction of massive data sets. Cambridge University Press, 2011. , check section 4.7 and especially section 4.7.3 for implementation details.

0
source

Source: https://habr.com/ru/post/985137/


All Articles