Duplicate Columns in a Spark Dataframe

Question

Duplicate Columns in a Spark Dataframe

I have a 10 GB csv file in hadoop clusters with duplicate columns. I am trying to parse it in SparkR, so I use the spark-csv package to parse it as a DataFrame :

  df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" )

But since df have duplicate Email columns, if I want to select this column, this will be an error:

 select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278) ...

I want to save the first occurrence of the Email column and delete the last, how to do it?

+5

r csv hadoop apache-spark sparkr

Bamqf Nov 19 '15 at 23:45

source share

2 answers

Try renaming the column.

You can select it by position instead of calling select .

 colnames(df)[column number of interest] <- 'deleteme'

Alternatively, you can just leave the column directly

  newdf <- df[,-x]

Where x is the column number you do not want.

Update:

If the above does not work, you can set the header to false, and then use the first row to rename the columns:

  df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "FALSE", mode = "DROPMALFORMED" ) #get first row to use as column names mycolnames <- df[1,] #edit the dup column *in situ* mycolnames[x] <- 'IamNotADup' colnames(df) <- df[1,] # drop the first row: df <- df[-1,]

+1

Minnow Nov 20 '15 at 0:53

source share

shoeboxer · Accepted Answer · 2015-11-20T01:52:41+0000

A better way would be to change the column name upstream;)

However, this seems to be impossible, so there are several options:

If the case of the columns is different ("email" and "email"), you can enable case sensitivity:
```
  sql(sqlContext, "set spark.sql.caseSensitive=true") 
```

If the column names match, you need to manually specify the schema and skip the first row to avoid the headers:

 customSchema <- structType( structField("year", "integer"), structField("make", "string"), structField("model", "string"), structField("comment", "string"), structField("blank", "string")) df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema)

Duplicate Columns in a Spark Dataframe

More articles: