How to read show statement output back to dataset?

Question

How to read show statement output back to dataset?

Assuming we have the following text file (output from the df.show() command):

 +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+

Now I want to read / parse it as a DataFrame / Dataset. What is the most “sparkling” way to do this?

ps I'm interested in solutions for both scala and pyspark , so both tags are used.

+5

scala apache-spark pyspark apache-spark-sql

Maxu Oct 21 '17 at 22:44

source share

1 answer

Maxu · Accepted Answer · 2017-10-21T22:45:26+0000

UPDATE: with the help of the UNIVOCITY lib analyzer, I could get rid of one line where I removed spaces in column names:

Scala:

 // read Spark Output Fixed width table: def readSparkOutput(filePath: String) : org.apache.spark.sql.DataFrame = { val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("parserLib","UNIVOCITY") .option("ignoreLeadingWhiteSpace","true") .option("ignoreTrailingWhiteSpace","true") .option("comment","+") .csv(filePath) t.select(t.columns.filterNot(_.startsWith("_c")).map(t(_)):_*) }

PySpark:

 def read_spark_output(file_path): t = spark.read \ .option("header","true") \ .option("inferSchema","true") \ .option("delimiter","|") \ .option("parserLib","UNIVOCITY") \ .option("ignoreLeadingWhiteSpace","true") \ .option("ignoreTrailingWhiteSpace","true") \ .option("comment","+") \ .csv("file:///tmp/spark.out") # select not-null columns return t.select([c for c in t.columns if not c.startswith("_")])

Usage example:

 scala> val df = readSparkOutput("file:///tmp/spark.out") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field] scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: integer (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true)

Old answer:

Here is my attempt at scala (Spark 2.2):

 // read Spark Output Fixed width table: val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("comment","+") .csv("file:///temp/spark.out") // select not-null columns val cols = t.columns.filterNot(c => c.startsWith("_c")).map(a => t(a)) // trim spaces from columns val colsTrimmed = t.columns.filterNot(c => c.startsWith("_c")).map(c => c.replaceAll("\\s+","")) // reanme columns using 'colsTrimmed' val df = t.select(cols:_*).toDF(colsTrimmed:_*)

It works, but I feel that there should be a much more elegant way to do it.

 scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1.0|pi number|3.141592| | 2.0| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: double (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true)

How to read show statement output back to dataset?

More articles: