How to run two spark jobs in parallel offline

I have a spark job in which I process a file and then do the following steps.

1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra

I wrote a spark task for this in which I have the following function calls

writeToES(df)
writeToCassandra(df)

Now these two operations are performed one after another. However, these two can work in parallel.

How can I do this in one spark task.

I can do two spark assignments for writing in ES and Cassandra. But they will use several ports that I want to avoid.

+4
source share
1 answer

. , , , , .

, , :

Spark ( SparkContext) , . "" Spark (, , ) , . Sparks , , (, ).

, ( API , async ):

CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));

, . , . FAIR :

conf.set("spark.scheduler.mode", "FAIR")
+2

Source: https://habr.com/ru/post/1695747/


All Articles