Spark applies function to columns in parallel

Question

Spark applies function to columns in parallel

Spark will process data in parallel, but not operations. In my DAG, I want to call a function per column, how to process columns in parallel The values for each column can be calculated independently of other columns. Is there a way to achieve such parallelism through an intrinsically safe API? Using window functions The dynamic DAG speaker is much slower and different from the hard-coded DAG. It helped optimize the DAG by a lot, but it only works as a serial.

An example that contains a bit more information can be found https://github.com/geoHeil/sparkContrastCoding

The minimum example is below:

val df = Seq(
    (0, "A", "B", "C", "D"),
    (1, "A", "B", "C", "D"),
    (0, "d", "a", "jkl", "d"),
    (0, "d", "g", "C", "D"),
    (1, "A", "d", "t", "k"),
    (1, "d", "c", "C", "D"),
    (1, "c", "B", "C", "D")
  ).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")

val inputToDrop = Seq("col3TooMany")
val inputToBias = Seq("col1", "col2")

val targetCounts = df.filter(df("TARGET") === 1).groupBy("TARGET").agg(count("TARGET").as("cnt_foo_eq_1"))
val newDF = df.toDF.join(broadcast(targetCounts), Seq("TARGET"), "left")
  newDF.cache
def handleBias(df: DataFrame, colName: String, target: String = target) = {
    val w1 = Window.partitionBy(colName)
    val w2 = Window.partitionBy(colName, target)

    df.withColumn("cnt_group", count("*").over(w2))
      .withColumn("pre2_" + colName, mean(target).over(w1))
      .withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
      .drop("cnt_group")
  }

val joinUDF = udf((newColumn: String, newValue: String, codingVariant: Int, results: Map[String, Map[String, Seq[Double]]]) => {
    results.get(newColumn) match {
      case Some(tt) => {
        val nestedArray = tt.getOrElse(newValue, Seq(0.0))
        if (codingVariant == 0) {
          nestedArray.head
        } else {
          nestedArray.last
        }
      }
      case None => throw new Exception("Column not contained in initial data frame")
    }
  })

handleBias , , .

val res = (inputToDrop ++ inputToBias).toSet.foldLeft(newDF) {
    (currentDF, colName) =>
      {
        logger.info("using col " + colName)
        handleBias(currentDF, colName)
      }
  }
    .drop("cnt_foo_eq_1")

val combined = ((inputToDrop ++ inputToBias).toSet).foldLeft(res) {
    (currentDF, colName) =>
      {
        currentDF
          .withColumn("combined_" + colName, map(col(colName), array(col("pre_" + colName), col("pre2_" + colName))))
      }
  }

val columnsToUse = combined
    .select(combined.columns
      .filter(_.startsWith("combined_"))
      map (combined(_)): _*)

val newNames = columnsToUse.columns.map(_.split("combined_").last)
val renamed = columnsToUse.toDF(newNames: _*)

val cols = renamed.columns
val localData = renamed.collect

val columnsMap = cols.map { colName =>
    colName -> localData.flatMap(_.getAs[Map[String, Seq[Double]]](colName)).toMap
}.toMap

+1

scala parallel-processing apache-spark apache-spark-sql

Georg Heiler 02 . '17 10:45

1

user6910411 · Accepted Answer · 2017-01-02T12:24:32+0000

, . DataFrames, , , .

handleBias , DataFrames , DataFrame. , fold , .

, ( ) :

:

df_with_id = df.withColumn("id", unique_id())

df :

dfs = for (c in columns) 
  yield handle_bias(df, c).withColumn(
    "pres", explode([(pre_name, pre_value), (pre2_name, pre2_value)])
  )

:
```
combined = dfs.reduce(union)
```

:

combined.groupBy("id").pivot("pres._1").agg(first("pres._2"))

, . , , , -.

(sum count(distinct x)) for x in columns)) , , , , aggregateByKey Map[Tuple2[_, _], StatCounter], , .

Spark applies function to columns in parallel

More articles: