You cannot add an arbitrary column to a DataFrame in Spark. New columns can only be created using literals (other types of literals are described in How to add a constant column to a Spark DataFrame? )
from pyspark.sql.functions import lit df = sqlContext.createDataFrame( [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3")) df_with_x4 = df.withColumn("x4", lit(0)) df_with_x4.show()
transform existing column:
from pyspark.sql.functions import exp df_with_x5 = df_with_x4.withColumn("x5", exp("x3")) df_with_x5.show()
enabled using join :
from pyspark.sql.functions import exp lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v")) df_with_x6 = (df_with_x5 .join(lookup, col("x1") == col("k"), "leftouter") .drop("k") .withColumnRenamed("v", "x6"))
or generated with the / udf function:
from pyspark.sql.functions import rand df_with_x7 = df_with_x6.withColumn("x7", rand()) df_with_x7.show()
Functional built-in functions ( pyspark.sql.functions ) that match the Catalyst expression are usually preferable to functions defined by the Python user.
If you want to add the contents of an arbitrary RDD as a column, you can
- add line numbers to an existing data frame
- calling
zipWithIndex in RDD and converting it to a data frame - join an index as a join key
zero323 Nov 12 '15 at 23:37 2015-11-12 23:37
source share