How to avoid shuffling when connecting DataFrames to unique keys?

Question

How to avoid shuffling when connecting DataFrames to unique keys?

I have two DataFrames Aand B:

Ahas columns (id, info1, info2)with approximately 200 million rows
Bonly has a column idwith 1 million rows

The column idis unique in both DataFrames.

I want a new DataFrame that filters Ato include only values from B.

if B was very small, I know that I would do something in the lines

A.filter($("id") isin B("id"))

but Bstill quite large, so not all of this can fit as a broadcast variable.

and i know i can use

A.join(B, Seq("id"))

but this would not affect the uniqueness, and I am afraid that this will lead to unnecessary shuffles.

?

+4

apache-spark apache-spark-sql

DeanLa 07 '17 12:19

3

Manish Saraf Bhardwaj · Answer 1 · 2017-05-08T08:55:16+0000

:

, ( , , , ). Spark , (, ). () (), .
, , . spark.sql.autoBroadcastJoinThreshold , .
. RDD , . , , , . , RDD ( ), , ( ). , , -, . , .
/ , (2) OOM, . (dataframe.write.partitionBy()). , "" .

Arvind Kumar Anugula · Answer 2 · 2017-05-08T08:52:45+0000

- Dataframe A, , "Join and Shuffle".

Partitioner:

A.join(B, Seq("id"))

, , . , .

HashPartitioner: partitionBy() Dataframe, Spark , hash-partitioned, join() . , A.join(B, Seq ( "id" )), Spark B RDD. B , A, B

:

 val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
     .partitionBy(new HashPartitioner(100)) // Create 100 partitions
     .persist()
 A.join(B, Seq("id"))

.

bshelt141 · Answer 3 · 2017-11-01T20:31:20+0000

, , DataFrame B node, (.. id DataFrame A) node , (.. ).

, , :

import org.apache.spark.sql.functions.broadcast

val joinExpr = A.col("id") === B.col("id")

val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")

filtered_A.explain(), , .

How to avoid shuffling when connecting DataFrames to unique keys?

More articles: