How to combine two "Dataframe" columns in Spark into one two-row?

I have a Spark DataFrame df with five columns. I want to add another column with its values ​​being a tuple of the first and second columns. When used with the withColumn () method, I get a mismatch error because the input is not a column type, but instead (column, column). I wonder if in this case there is a solution for the loop for loops?

 var dfCol=(col1:Column,col2:Column)=>(col1,col2) val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) ) 
+8
source share
4 answers

You can use the udf custom function to achieve what you want.

UDF Definition

 object TupleUDFs { import org.apache.spark.sql.functions.udf // type tag is required, as we have a generic udf import scala.reflect.runtime.universe.{TypeTag, typeTag} def toTuple2[S: TypeTag, T: TypeTag] = udf[(S, T), S, T]((x: S, y: T) => (x, y)) } 

Using

 df.withColumn( "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b")) ) 

Assuming that "a" and "b" are Int columns that you want to put in the tuple.

+11
source

You can use the struct function which creates a tuple from the provided columns:

 import org.apache.spark.sql.functions.struct val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b") df.withColumn("NewColumn", struct(df("a"), df("b")).show(false) +---+---+---------+ |a |b |NewColumn| +---+---+---------+ |1 |2 |[1,2] | |3 |4 |[3,4] | |5 |3 |[5,3] | +---+---+---------+ 
+17
source

You can combine multiple dataframe columns into one array.

 // $"*" will capture all existing columns df.select($"*", array($"col1", $"col2").as("newCol")) 
+2
source

If you want to combine two columns of data into one column. Just:

 import org.apache.spark.sql.functions.array df.withColumn("NewColumn", array("columnA", "columnB")) 
0
source

Source: https://habr.com/ru/post/1232358/


All Articles