Broadcast does not happen when connected to frameworks in Spark 1.6

Question

Broadcast does not happen when connected to frameworks in Spark 1.6

Below is an example of the code that I am running. when this spark job runs, Dataframe connections are made using sortmergejoin instead of broadcastjoin.

def joinedDf (sqlContext: SQLContext, txnTable: DataFrame, countriesDfBroadcast: Broadcast[DataFrame]): DataFrame = { txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner") } joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp")

Transmission does not occur even when I specify the broadcast () hint in the connection statement.

The optimizer hashpartitioning dataframe and causes data corruption.

Has anyone seen this behavior?

I use this yarn using Spark 1.6 and HiveContext as SQLContext. The spark begins with 200 performers. and the txnTable data size is 240 GB, and the data size of the Df countries is 5 MB.

+5

join scala query-optimization apache-spark apache-spark-sql spark-dataframe

Prasad R. Feb 05 '16 at 23:46

source share

1 answer

zero323 · Accepted Answer · 2016-02-07T18:11:59+0000

Both methods of transferring a DataFrame and how you access it are incorrect.

Standard broadcasts cannot be used to process distributed data structures. If you want to make a broadcast connection to a DataFrame , you must use the broadcast functions that are marked as a DataFrame for translation:
```
 import org.apache.spark.sql.functions.broadcast val countriesDf: DataFrame = ??? val tmp: DataFrame = broadcast( countriesDf.withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries") ) txnTable.as("df1").join( broadcast(tmp), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner") 
```
Inside, it will collect tmp without conversion from internal and subsequent broadcasts.
The arguments for joining are eagerly appreciated. It was even possible to use SparkContext.broadcast with a distributed data structure, the broadcast value is evaluated locally before calling join . This is why your function works at all, but does not make a broadcast connection.

Broadcast does not happen when connected to frameworks in Spark 1.6

More articles: