Duplicate Columns in a Spark Dataframe

I have a 10 GB csv file in hadoop clusters with duplicate columns. I am trying to parse it in SparkR, so I use the spark-csv package to parse it as a DataFrame :

  df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" ) 

But since df have duplicate Email columns, if I want to select this column, this will be an error:

 select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278) ... 

I want to save the first occurrence of the Email column and delete the last, how to do it?

+5
source share
2 answers

A better way would be to change the column name upstream;)

However, this seems to be impossible, so there are several options:

  • If the case of the columns is different ("email" and "email"), you can enable case sensitivity:

      sql(sqlContext, "set spark.sql.caseSensitive=true") 
  • If the column names match, you need to manually specify the schema and skip the first row to avoid the headers:

     customSchema <- structType( structField("year", "integer"), structField("make", "string"), structField("model", "string"), structField("comment", "string"), structField("blank", "string")) df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema) 
+4
source

Try renaming the column.

You can select it by position instead of calling select .

 colnames(df)[column number of interest] <- 'deleteme' 

Alternatively, you can just leave the column directly

  newdf <- df[,-x] 

Where x is the column number you do not want.

Update:

If the above does not work, you can set the header to false, and then use the first row to rename the columns:

  df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "FALSE", mode = "DROPMALFORMED" ) #get first row to use as column names mycolnames <- df[1,] #edit the dup column *in situ* mycolnames[x] <- 'IamNotADup' colnames(df) <- df[1,] # drop the first row: df <- df[-1,] 
+1
source

Source: https://habr.com/ru/post/1236381/


All Articles