Multiple insertion into a table using Apache Spark

I am working on a project and I am sticking to the following scenario.

I have a table: superMerge (id, name, salary)

and I have two more tables: table1 and table2

all tables (table1, table2 and superMerge) have the same structure.

Now my task is to insert / update the supermarket table from table1 and table2. table1 is updated every 10 minutes and table2 every 20 minutes, so at time t = 20mins I have 2 jobs trying to update the same table (superMerge in this case.)

I want to understand how I can get this parallel insert / update / merge into a superMerge table using Spark or any other hadoop application.

+4
source share
4

, , , . , "" :

  • () HDFS, , /, ,

  • , /. , , .

+5

job1 job2? ?

, 1, 10 . 20 ( 2 ), 1 . , , (, ). , 10 .

+1

? Batch (Spark) / (Spark Streaming)?

, :

  • 10 , . 1 2, superMerge. .
  • , SuperMerge , .
+1

, tb1 DF1 location1 tb2 DF2 2, , , .

1 2:

df1.write.mode("overwrite").partitionBy("partition").parquet(location1)

df2.write.mode("overwrite").partitionBy("partition").parquet(location2)

:

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

.

0

Source: https://habr.com/ru/post/1688227/


All Articles