An unlimited table is structured streaming

I'm starting to learn Spark, and it's hard for me to understand the rationality of Structured Streaming in Spark. Structured stream processing processes all the data coming in as an unlimited input table, with each new element in the data stream being considered as a new row in the table. I have the following code snippet to read incoming files in csvFolder.

val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")

val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")

val query = streamingDF.writeStream
  .format("console")
  .start()

What happens if I give a 1GB file to a folder. According to the specifications, the streaming job starts every few milliseconds. If Spark encounters such a huge file at the next moment, will it run out of memory when trying to download the file. Or does it load automatically? If yes, is this dosing parameter configured?

+4
1

.

, : , , , . enter image description here , . DataFrames/Datasets , DataFrame/Dataset , .

, . enter image description here

. Spark , . ? , ?

: OOM, RDD (DF/DS) . , , , ...

+5

Source: https://habr.com/ru/post/1677623/


All Articles