How to save / insert each DStream into a persistent table

I had a problem with "Spark Streaming" about injecting a Dstream output stream into a persistent SQL table. I would like to insert each DStream result (coming from one batch that has spark processes) into a unique table. I used Python with version 1.6.2 of Spark.

In this part of my code, I have a Dstream, consisting of one or more RDDs, which I would like to constantly insert / store in the SQL table without losing the result for each batch processed.

rr = feature_and_label.join(result_zipped)\
                      .map(lambda x: (x[1][0][0], x[1][1]) )

Each Dstream is represented here, for example, as this tuple: (4.0, 0). I can’t use SparkSQL, since Spark treats the β€œtable”, that is, as a temporary table, therefore it loses the result in each batch.

This is an example output:


Time: 2016-09-23 00:57:00

(0.0, 2)


: 2016-09-23 00:57:01

(4.0, 0)


: 2016-09-23 00:57:02

(4.0, 0)

...

, Dstream. , , -, , , . : ?
, - , , . .

+4
2

Vanilla Spark , , HDFS ( Spark 2.0). - Spark Database Ecosystem. . - . :

, , Spark

, SQL, Integrated

, SQL,

, NoSQL,

, ,

, ,

, ,

, SQL,

, NoSQL,

, ,

  • HDFS

, ,

Datawarehouse, SQL, Connector

+6

.

0

Source: https://habr.com/ru/post/1655550/


All Articles