How to read show statement output back to dataset?

Assuming we have the following text file (output from the df.show() command):

 +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+ 

Now I want to read / parse it as a DataFrame / Dataset. What is the most β€œsparkling” way to do this?

ps I'm interested in solutions for both scala and pyspark , so both tags are used.

+5
source share
1 answer

UPDATE: with the help of the UNIVOCITY lib analyzer, I could get rid of one line where I removed spaces in column names:

Scala:

 // read Spark Output Fixed width table: def readSparkOutput(filePath: String) : org.apache.spark.sql.DataFrame = { val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("parserLib","UNIVOCITY") .option("ignoreLeadingWhiteSpace","true") .option("ignoreTrailingWhiteSpace","true") .option("comment","+") .csv(filePath) t.select(t.columns.filterNot(_.startsWith("_c")).map(t(_)):_*) } 

PySpark:

 def read_spark_output(file_path): t = spark.read \ .option("header","true") \ .option("inferSchema","true") \ .option("delimiter","|") \ .option("parserLib","UNIVOCITY") \ .option("ignoreLeadingWhiteSpace","true") \ .option("ignoreTrailingWhiteSpace","true") \ .option("comment","+") \ .csv("file:///tmp/spark.out") # select not-null columns return t.select([c for c in t.columns if not c.startswith("_")]) 

Usage example:

 scala> val df = readSparkOutput("file:///tmp/spark.out") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field] scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1|pi number|3.141592| | 2| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: integer (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true) 

Old answer:

Here is my attempt at scala (Spark 2.2):

 // read Spark Output Fixed width table: val t = spark.read .option("header","true") .option("inferSchema","true") .option("delimiter","|") .option("comment","+") .csv("file:///temp/spark.out") // select not-null columns val cols = t.columns.filterNot(c => c.startsWith("_c")).map(a => t(a)) // trim spaces from columns val colsTrimmed = t.columns.filterNot(c => c.startsWith("_c")).map(c => c.replaceAll("\\s+","")) // reanme columns using 'colsTrimmed' val df = t.select(cols:_*).toDF(colsTrimmed:_*) 

It works, but I feel that there should be a much more elegant way to do it.

 scala> df.show +----+---------+--------+ |col1| col2| col3| +----+---------+--------+ | 1.0|pi number|3.141592| | 2.0| e number| 2.71828| +----+---------+--------+ scala> df.printSchema root |-- col1: double (nullable = true) |-- col2: string (nullable = true) |-- col3: double (nullable = true) 
+4
source

Source: https://habr.com/ru/post/1272801/


All Articles