I have a 10 GB csv file in hadoop clusters with duplicate columns. I am trying to parse it in SparkR, so I use the spark-csv package to parse it as a DataFrame :
df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" )
But since df have duplicate Email columns, if I want to select this column, this will be an error:
select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email
I want to save the first occurrence of the Email column and delete the last, how to do it?
Bamqf source share