How to add multiple columns using UDF?

Question

I want to add UDF return values โ€‹โ€‹to an existing data frame in separate columns. How can I achieve this in a resourceful way?

Here is an example of what I have so far.

from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType df = spark.createDataFrame([("Alive",4)],["Name","Number"]) df.show(1) +-----+------+ | Name|Number| +-----+------+ |Alive| 4| +-----+------+ def example(n): return [[n+2], [n-2]] # schema = StructType([ # StructField("Out1", ArrayType(IntegerType()), False), # StructField("Out2", ArrayType(IntegerType()), False)]) example_udf = udf(example) 

Now I can add a column to the dataframe as follows

 newDF = df.withColumn("Output", example_udf(df["Number"])) newDF.show(1) +-----+------+----------+ | Name|Number|Output | +-----+------+----------+ |Alive| 4|[[6], [2]]| +-----+------+----------+ 

However, I do not want the two values โ€‹โ€‹to be in the same column, and not in separate ones.

Ideally, I would like to split the output column to avoid calling the example function twice (once for each return value), as described here and here , however in my situation I get an array of arrays and I donโ€™t see how the split will work ( note that each array will contain several values โ€‹โ€‹separated by a ",".

As a result, it should look like

Ultimately, I want this

 +-----+------+----+----+ | Name|Number|Out1|Out2| +-----+------+----+----+ |Alive| 4| 6| 2| +-----+------+----+----+ 

Note that using the StructType return type is optional and does not have to be part of the solution.

EDIT: I commented on the use of StructType (and edited the udf assignment), since this is not necessary for the return type of an example function. However, it should be used if the return value is something like

 return [6,3,2],[4,3,1] 
+5
source share
1 answer

To return a StructType , just using Row

 df = spark.createDataFrame([("Alive", 4)], ["Name", "Number"]) def example(n): return Row('Out1', 'Out2')(n + 2, n - 2) schema = StructType([ StructField("Out1", IntegerType(), False), StructField("Out2", IntegerType(), False)]) example_udf = f.UserDefinedFunction(example, schema) newDF = df.withColumn("Output", example_udf(df["Number"])) newDF = newDF.select("Name", "Number", "Output.*") newDF.show(truncate=False) 
+7
source

Source: https://habr.com/ru/post/1273879/


All Articles