What is the limit of sparking in terms of data quantity?

Question

What is the limit of sparking in terms of data quantity?

I have tens of millions of rows of data. Is it possible to analyze all this for a week or a day using spark flow? What is the limit of sparking in terms of data quantity? I'm not sure what the upper limit is and when I should put them in my database, since Stream probably doesn't handle them anymore. I also have different time windows of 1.3, 6 hours, etc., where I use window operations to separate data.

Please find my code below:

conf = SparkConf().setAppName(appname) sc = SparkContext(conf=conf) ssc = StreamingContext(sc,300) sqlContext = SQLContext(sc) channels = sc.cassandraTable("abc","channels") topic = 'abc.crawled_articles' kafkaParams = {"metadata.broker.list": "0.0.0.0:9092"} category = 'abc.crawled_article' category_stream = KafkaUtils.createDirectStream(ssc, [category], kafkaParams) category_join_stream = category_stream.map(lambda x:read_json(x[1])).filter(lambda x:x!=0).map(lambda x:categoryTransform(x)).filter(lambda x:x!=0).map(lambda x:(x['id'],x)) article_stream = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams) article_join_stream=article_stream.map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:TransformInData(x)).filter(lambda x: x!=0).flatMap(lambda x:(a for a in x)).map(lambda x:(x['id'].encode("utf-8") ,x)) #axes topic integration the article and the axes axes_topic = 'abc.crawled_axes' axes_stream = KafkaUtils.createDirectStream(ssc, [axes_topic], kafkaParams) axes_join_stream = axes_stream.filter(lambda x:'delete' not in str(x)).map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:axesTransformData(x)).filter(lambda x: x!=0).map(lambda x:(str(x['id']),x)).map(lambda x:(x[0],{'id':x[0], 'attitudes':x[1]['likes'],'reposts':0,'comments':x[1]['comments'],'speed':x[1]['comments']})) #axes_join_stream.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10).transform(axestrans).pprint() #join statistics = article_join_stream.window(1*60*60,5*60).cogroup(category_join_stream.window(1*60*60,60)).cogroup((axes_join_stream.window(24*60*60,5*60))) statistics.transform(joinstream).pprint() ssc.start() # Start the computation ssc.awaitTermination() ssc.awaitTermination()

+5

apache-spark spark-streaming datastax-enterprise

peter Feb 29 '16 at 2:51

source share

1 answer

etov · Accepted Answer · 2016-05-02T17:25:58+0000

One at a time:

Is it possible to parse [a few large lines] within [a given amount of time]?

As a rule, yes - Spark allows you to scale on many machines, so in principle you should be able to run a large cluster and collect a lot of data in a relatively short time (provided that we are talking about hours or days, not seconds or less, which may be problematic due to overhead).

In particular, the execution of the type of processing illustrated in your questions on tens of millions of records seems to me possible in a reasonable amount of time (i.e., without using an extremely large cluster).

What is the limit of spark stream in terms of data quantity?

I don’t know, but it will be difficult for you to get to it. There are examples of extremely large deployments, for example. on ebay ("hundreds of metrics averaging 30 TB per day"). Also, see the FAQ for a cluster of 8,000 machines and PB data processing.

When should the results be written to [some store]?

In accordance with the basic Spark-Streaming model, data is processed in micro-packets. If your data is really a stream (i.e., it doesn’t have a definite ending), then the simplest approach would be to save the processing results of each RDD (that is, a micropackage).

If your data is NOT a stream, for example. you process a bunch of static files from time to time, you should probably consider abandoning part of the stream (for example, using only Spark as a batch processor).

Since your question mentions window sizes in a few hours, I suspect you might consider a batch option.

How to process the same data in different time windows?

If you use Spark-Streaming, you can support several states (for example, using mapWithState ) - one for each time window.

Another idea (easier in code, more complicated from the point of view of ops) is that you can run several clusters, each with its own window, reading from one stream.

If you perform batch processing, you can perform the same operation several times with different time windows, for example. reduceByWindow with multiple window sizes.

What is the limit of sparking in terms of data quantity?

More articles: