How to read parquet data from S3 to fix data in Python?

I am new to Spark and I cannot find this ... I have many parquet files uploaded s3to at:

s3://a-dps/d-l/sco/alpha/20160930/parquet/

The total size of this folder 20+ Gb. How to write and read this in a data frame How to load all these files into a dataframe?

The allocated memory for the spark cluster is 6 GB.

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    import pandas
    # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")
    sc = SparkContext.getOrCreate()

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')

    sqlContext = SQLContext(sc)
    df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

Error:

    Py4JJavaError: An error occurred while calling o33.parquet.
    : java.io.IOException: No FileSystem for scheme: s3
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass (FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.createFileSystem (FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access $ 200 (FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)

 
+7
2

(s3), , . s3n s3a ( s3):

// use sqlContext instead for spark <2 
val df = spark.read 
              .load("s3n://bucket-name/object-path")

Hadoop-AWS: - Amazon.

+9

SparkSession sqlContext, Spark 2.0

spark = SparkSession.builder
                        .master("local")             
                        .appName("app name")             
                        .config("spark.some.config.option", true).getOrCreate()

df = spark.read.parquet("s3://path/to/parquet/file.parquet")
+5

Source: https://habr.com/ru/post/1679625/


All Articles