Multiple insertion into a table using Apache Spark

Question

Multiple insertion into a table using Apache Spark

I am working on a project and I am sticking to the following scenario.

I have a table: superMerge (id, name, salary)

and I have two more tables: table1 and table2

all tables (table1, table2 and superMerge) have the same structure.

Now my task is to insert / update the supermarket table from table1 and table2. table1 is updated every 10 minutes and table2 every 20 minutes, so at time t = 20mins I have 2 jobs trying to update the same table (superMerge in this case.)

I want to understand how I can get this parallel insert / update / merge into a superMerge table using Spark or any other hadoop application.

+4

hadoop bigdata apache-spark phoenix

Gkv Oct 26 '17 at 13:57

source share

4

Raphael Roth · Answer 1 · 2017-11-01T07:14:43+0000

, , , . , "" :

() HDFS, , /, ,
, /. , , .

Igor Berman · Answer 2 · 2017-11-02T13:16:57+0000

job1 job2? ?

, 1, 10 . 20 ( 2 ), 1 . , , (, ). , 10 .

nicou · Answer 3 · 2017-11-02T14:31:21+0000

? Batch (Spark) / (Spark Streaming)?

, :

10 , . 1 2, superMerge. .
, SuperMerge , .

roh · Answer 4 · 2017-11-06T16:02:03+0000

, tb1 DF1 location1 tb2 DF2 2, , , .

1 2:

df1.write.mode("overwrite").partitionBy("partition").parquet(location1)

df2.write.mode("overwrite").partitionBy("partition").parquet(location2)

:

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

.

Multiple insertion into a table using Apache Spark

More articles: