Full disclosure: I work for the Databricks, but I do not represent them in Stack Overflow.
How to use Spark 2.0, can I read these nested subfolders and create a static Dataframe from all sheet json files? Is there an “option” for a data reader?
DataFrameReader supports loading sequences. See the documentation for def json (paths: String *): DataFrame . You can specify a sequence, use a globe template, or create it programmatically (recommended):
val inputPathSeq = Seq[String]("/mnt/myles/structured-streaming/2016/12/18/02", "/mnt/myles/structured-streaming/2016/12/18/03") val inputPathGlob = "/mnt/myles/structured-streaming/2016/12/18/*" val basePath = "/mnt/myles/structured-streaming/2016/12/18/0" val inputPathList = (2 to 4).toList.map(basePath+_+"/*.json")
I know that this is all experimental - in the hope that someone used S3 as the source of the stream file before, where the data is divided into folders, as described above. Of course, we would prefer a direct Kinesis stream, but there is no date on this connector, so Firehose-> S3 is intermediate.
Since you are using DBFS, I am going to assume that the S3 buckets where data is transferred from Firehose are already installed in DBFS. Check out the Databricks documentation if you need help installing your S3 bucket in DBFS . After you have specified your input path described above, you can simply upload the files to a static or stream data frame:
Static
val staticInputDF = spark .read .schema(jsonSchema) .json(inputPathSeq : _*) staticInputDF.isStreaming res: Boolean = false
Streaming
val streamingInputDF = spark .readStream // `readStream` instead of `read` for creating streaming DataFrame .schema(jsonSchema) // Set the schema of the JSON data .option("maxFilesPerTrigger", 1) // Treat a sequence of files as a stream by picking one file at a time .json(inputPathSeq : _*) streamingCountsDF.isStreaming res: Boolean = true
Most of them come directly from the Databricks Documentation on Structured Streaming . There is even an example of a laptop that you can import directly into the Databricks.