Parse CSV and load as DataFrame / DataSet with Spark 2.x
Initialize the default SparkSession object SparkSession , it will be available in shells as spark
val spark = org.apache.spark.sql.SparkSession.builder .master("local") .appName("Spark CSV Reader") .getOrCreate;
Use any of the following CSV loading methods as a DataFrame/DataSet
1. Do it programmatically
val df = spark.read .format("csv") .option("header", "true") //first line in file has headers .option("mode", "DROPMALFORMED") .load("hdfs:///csv/file/dir/file.csv")
2. You can also do it in SQL way
val df = spark.sql("SELECT * FROM csv.'hdfs:///csv/file/dir/file.csv'")
Dependencies :
"org.apache.spark" % "spark-core_2.11" % 2.0.0, "org.apache.spark" % "spark-sql_2.11" % 2.0.0,
Spark version <2.0
val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("mode", "DROPMALFORMED") .load("csv/file/path");
dependencies:
"org.apache.spark" % "spark-sql_2.10" % 1.6.0, "com.databricks" % "spark-csv_2.10" % 1.6.0, "com.univocity" % "univocity-parsers" % LATEST,
mrsrinivas Sep 16 '16 at 14:01 2016-09-16 14:01
source share