Starting with Spark 2.0, CSV can be read directly in the DataFrame .
If the data file does not have a header line, this will be:
val df = spark.read.csv("file://path/to/data.csv")
This will load the data, but give each column common names such as _c0 , _c1 , etc.
If there are headers, then adding .option("header", "true") will use the first line to define the columns in the DataFrame :
val df = spark.read .option("header", "true") .csv("file://path/to/data.csv")
For a specific example, suppose you have a file with the contents:
user,topic,hits om,scala,120 daniel,spark,80 3754978,spark,1
Then you get the total number of hits, grouped by topic:
import org.apache.spark.sql.functions._ import spark.implicits._ val rawData = spark.read .option("header", "true") .csv("file://path/to/data.csv")
hayden.sikh Jun 15 '17 at 0:40 2017-06-15 00:40
source share