How to combine two "Dataframe" columns in Spark into one two-row?

Question

How to combine two "Dataframe" columns in Spark into one two-row?

I have a Spark DataFrame df with five columns. I want to add another column with its values being a tuple of the first and second columns. When used with the withColumn () method, I get a mismatch error because the input is not a column type, but instead (column, column). I wonder if in this case there is a solution for the loop for loops?

 var dfCol=(col1:Column,col2:Column)=>(col1,col2) val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )

+8

scala apache-spark-sql spark-dataframe

TNM Sep 26 '15 at 16:53

source share

4 answers

You can use the struct function which creates a tuple from the provided columns:

 import org.apache.spark.sql.functions.struct val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b") df.withColumn("NewColumn", struct(df("a"), df("b")).show(false) +---+---+---------+ |a |b |NewColumn| +---+---+---------+ |1 |2 |[1,2] | |3 |4 |[3,4] | |5 |3 |[5,3] | +---+---+---------+

+17

Tautvydas Aug 4 '17 at 12:01

source share

You can combine multiple dataframe columns into one array.

 // $"*" will capture all existing columns df.select($"*", array($"col1", $"col2").as("newCol"))

+2

Abu shoeb Dec 12 '17 at 22:17

source share

If you want to combine two columns of data into one column. Just:

 import org.apache.spark.sql.functions.array df.withColumn("NewColumn", array("columnA", "columnB"))

0

Skateboard Apr 14 '19 at 15:31

source share

Martin senne · Accepted Answer · 2015-09-26T17:46:58+0000

You can use the udf custom function to achieve what you want.

UDF Definition

 object TupleUDFs { import org.apache.spark.sql.functions.udf // type tag is required, as we have a generic udf import scala.reflect.runtime.universe.{TypeTag, typeTag} def toTuple2[S: TypeTag, T: TypeTag] = udf[(S, T), S, T]((x: S, y: T) => (x, y)) }

Using

 df.withColumn( "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b")) )

Assuming that "a" and "b" are Int columns that you want to put in the tuple.

How to combine two "Dataframe" columns in Spark into one two-row?

UDF Definition

Using

More articles: