Spark 1.6 applies a function to a column with a dot in the name / How to call colName correctly

To apply a function to a column in Spark, the general way (just the way?) Seems

df.withColumn(colName, myUdf(df.col(colName))

excellent, but I have columns with dots in the name, and to access the column I need to escape the name with the backtick "` "

The problem is this: if I use this escaped name, the .withColumn function creates a new column with the escaped name

 df.printSchema root |-- raw.hourOfDay: long (nullable = false) |-- raw.minOfDay: long (nullable = false) |-- raw.dayOfWeek: long (nullable = false) |-- raw.sensor2: long (nullable = false) df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay")) org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2); 

it works:

 df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`")) df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint] scala> df.printSchema root |-- raw.hourOfDay: long (nullable = false) |-- raw.minOfDay: long (nullable = false) |-- raw.dayOfWeek: long (nullable = false) |-- raw.sensor2: long (nullable = false) |-- `raw.hourOfDay`: long (nullable = false) 

but, as you can see, the schema has a new escaped column name.

If I do this above and try to delete the old column with the escaped name, it will lose the old column, but after that any attempt to access the new column will lead to something like:

 org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`); 

as if he now understands the flip side as name pairs rather than an escape character.

So, how do I replace the old withColumn column without changing the name?

(PS: note that my column names are parametric, so I use a loop over the names. Here I used specific row names: the escape sequence really looks like "` "+ colName +" `")

EDIT:

right now the only trick i found was to do:

 for (t <- df.columns) { if (t.contains(".")) { df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`"))) df = df.drop(df.col("`" + t + "`")) df = df.withColumnRenamed("`" + t + "`", t) } else { df = df.withColumn(t, myUdf(df.col(t))) } } 

Not very impressive, I think ...

EDIT :

Documentation Status:

 def withColumn(colName: String, col: Column): DataFrame Returns a new DataFrame by adding a column or replacing the existing column that has the same name. 

Therefore, replacing a column should not be a problem. However, as pointed out by @Glennie below, using the new name works fine, so it could be a bug in Spark 1.6

+5
source share
2 answers

Thanks for the trick.

 df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`"))) df = df.drop(df.col("`" + t + "`")) df = df.withColumnRenamed("`" + t + "`", t) 

It works great for me. Looking forward to a better solution. Just to remind you that we will have similar problems with the "#" symbol.

+3
source

I do not believe that you can add a column with the same name as an existing column (and why would you?).

 df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`")) 

will fail when specified, but not because the name is incorrectly escaped, but because the name is identical to an existing column.

 df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`")) 

on the other hand, will only evaluate the penalty :)

+1
source

Source: https://habr.com/ru/post/1245048/


All Articles