Saving text in Spark 1.30 using Dataframes in Scala

I use Spark version 1.3.0 and use dataframes with SparkSQL in Scala. In version 1.2.0, the "saveAsText" method appeared. In version 1.3.0 using dataframes there is only a "save" method. The default output is parquet.
How can I indicate the output should be TEXT using the save method?

// sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ // Define the schema using a case class. // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers.save("/user/me/out") 
+6
source share
4 answers

You can use this:

 teenagers.rdd.saveAsTextFile("/user/me/out") 
+12
source

First, you should consider whether you really need to save the data frame as text. Because a DataFrame stores data column- DataFrame (rather than row-wise as rdd), the .rdd operation is expensive because it requires data processing. parquet is a columnar format and much more efficient to use.

In this case, sometimes you really need to save as a text file.

As far as I know, a DataFrame out of the box will not let you save as a text file. If you look at the source code , you will see that 4 formats are supported:

 jdbc json parquet orc 

so your options either use df.rdd.saveAsTextFile , as suggested earlier, or use spark-csv , which allows you to do something like:

Spark 1.4+:

 val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv") df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv") 

Spark 1.3:

 val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true")) df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv") 

value-added processing of annoying parts of quoting and escaping strings

+8
source

If you look at the migration guide https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 , you will see that

[...] DataFrames are no longer inherited from RDD directly [...]

You can still use saveAsTextFile if you use the ".rdd" method to get the RDD [Row].

+1
source

In python: to get CSV (no header) for dataframe df

 df.rdd.map(lambda r: ";".join([str(c) for c in r])).saveAsTextFile(outfilepath) 

There is also an extension developed by Databricks: spark-csv

Cf https://github.com/databricks/spark-csv

0
source

Source: https://habr.com/ru/post/984318/


All Articles