I am new to Spark and I cannot find this ... I have many parquet files uploaded s3to at:
s3:
The total size of this folder 20+ Gb. How to write and read this in a data frame How to load all these files into a dataframe?
The allocated memory for the spark cluster is 6 GB.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pandas
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')
sqlContext = SQLContext(sc)
df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")
Error:
Py4JJavaError: An error occurred while calling o33.parquet.
: java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass (FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem (FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access $ 200 (FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)