To apply a function to a column in Spark, the general way (just the way?) Seems
df.withColumn(colName, myUdf(df.col(colName))
excellent, but I have columns with dots in the name, and to access the column I need to escape the name with the backtick "` "
The problem is this: if I use this escaped name, the .withColumn function creates a new column with the escaped name
df.printSchema root |-- raw.hourOfDay: long (nullable = false) |-- raw.minOfDay: long (nullable = false) |-- raw.dayOfWeek: long (nullable = false) |-- raw.sensor2: long (nullable = false) df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay")) org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2);
it works:
df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`")) df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint] scala> df.printSchema root |-- raw.hourOfDay: long (nullable = false) |-- raw.minOfDay: long (nullable = false) |-- raw.dayOfWeek: long (nullable = false) |-- raw.sensor2: long (nullable = false) |-- `raw.hourOfDay`: long (nullable = false)
but, as you can see, the schema has a new escaped column name.
If I do this above and try to delete the old column with the escaped name, it will lose the old column, but after that any attempt to access the new column will lead to something like:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`);
as if he now understands the flip side as name pairs rather than an escape character.
So, how do I replace the old withColumn
column without changing the name?
(PS: note that my column names are parametric, so I use a loop over the names. Here I used specific row names: the escape sequence really looks like "` "+ colName +" `")
EDIT:
right now the only trick i found was to do:
for (t <- df.columns) { if (t.contains(".")) { df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`"))) df = df.drop(df.col("`" + t + "`")) df = df.withColumnRenamed("`" + t + "`", t) } else { df = df.withColumn(t, myUdf(df.col(t))) } }
Not very impressive, I think ...
EDIT :
Documentation Status:
def withColumn(colName: String, col: Column): DataFrame Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Therefore, replacing a column should not be a problem. However, as pointed out by @Glennie below, using the new name works fine, so it could be a bug in Spark 1.6