Spark 2.0.x unloads a csv file from a data frame containing one array of string type

I have a dataframe df that contains a single column of type array

df.show() looks like

 |ID|ArrayOfString|Age|Gender| +--+-------------+---+------+ |1 | [A,B,D] |22 | F | |2 | [A,Y] |42 | M | |3 | [X] |60 | F | +--+-------------+---+------+ 

I am trying to reset this df in csv file as follows:

 val dumpCSV = df.write.csv(path="/home/me/saveDF") 

It does not work due to an ArrayOfString column. I get an error message:

CSV data source does not support array string data type

The code works if I delete the ArrayOfString column. But I need to save ArrayOfString !

What would be the best way to reset the csv framework, including the ArrayOfString column (ArrayOfString should be reset as one column in the CSV file)

+25
source share
5 answers

The reason you get this error is because the csv file format does not support array types, you need to express it as a string that can save.

Try the following:

 import org.apache.spark.sql.functions._ val stringify = udf((vs: Seq[String]) => vs match { case null => null case _ => s"""[${vs.mkString(",")}]""" }) df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...) 

or

 import org.apache.spark.sql.Column def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]")) df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...) 
+23
source

Pyspark implementation:

In this example, before saving, change the column_as_array field to column_as_string .

 from pyspark.sql.functions import udf from pyspark.sql.types import StringType def array_to_string(my_list): return '[' + ','.join([str(elem) for elem in my_list]) + ']' array_to_string_udf = udf(array_to_string,StringType()) df = df.withColumn('column_as_str',array_to_string_udf(d["column_as_array"])) 

Then you can delete the old column (array type) before saving.

 df.drop("column_as_array").write.csv(...) 
+12
source

There is no need for UDF if you already know which fields contain arrays. You can simply use the Cast Spark function:

  val dumpCSV = df.withColumn("ArrayOfString", lit("ArrayOfString).cast("string")) .write .csv(path="/home/me/saveDF" ) 

Hope this helps.

+6
source

Here is a method to convert all ArrayType columns (of any base type) of a DataFrame to StringType :

 def stringifyArrays(dataFrame: DataFrame): DataFrame = { val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name) colsToStringify.foldLeft(dataFrame)((df, c) => { df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]"))) }) } 

In addition, it does not use UDF.

+5
source

CSV is not an ideal export format, but if you just want to visually check your data, it will work [Scala]. Quick and dirty solution.

 case class example ( id: String, ArrayOfString: String, Age: String, Gender: String) df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv") 
0
source

Source: https://habr.com/ru/post/1012037/


All Articles