I have an operation in a spark that needs to be performed for multiple columns in a data frame. There are usually two options for specifying such operations.
handleBias("bar", df)
.join(handleBias("baz", df), df.columns)
.drop(columnsToDrop: _*).show
- dynamically generate them from a list of code names
var isFirst = true
var res = df
for (col <- columnsToDrop ++ columnsToCode) {
if (isFirst) {
res = handleBias(col, res)
isFirst = false
} else {
res = handleBias(col, res)
}
}
res.drop(columnsToDrop: _*).show
The problem is that the dynamically generated dynamic DAG system is different and the execution time of the dynamic solution increases much more when more columns are used than for hard-coded operations.
I am curious how to combine the elegance of a dynamic design with fast lead times .
Here is a comparison for the DAG code example

80
, , DAG .

(2.0.2) DataFrames spark-sql
:
def handleBias(col: String, df: DataFrame, target: String = "FOO"): DataFrame = {
val pre1_1 = df
.filter(df(target) === 1)
.groupBy(col, target)
.agg((count("*") / df.filter(df(target) === 1).count).alias("pre_" + col))
.drop(target)
val pre2_1 = df
.groupBy(col)
.agg(mean(target).alias("pre2_" + col))
df
.join(pre1_1, Seq(col), "left")
.join(pre2_1, Seq(col), "left")
.na.fill(0)
}
foldleft DAG

, DAG, , , . , SQL , , . ?