FileNotFoundException when trying to save the DataFrame format to the parquet, with the overwrite mode

I have this weird mistake. I have a program that reads a data frame if it exists (or creates it otherwise), modifies it, and then saves it again on the same target path in a parquet format with overwrite mode.

In the first run, when there is no data frame, I create it and save it. It generates 4 files in the output folder:

  • _SUCCESS.crc
  • part-R- <......> snappy.parquet.crc
  • _SUCCESS
  • part-R- <......> snappy.parquet

Then in the second run, I read the data, modify it, and when I try to overwrite it, it throws an exception that *part-r-<.....>.snappy.parquet file does not exist*.

The output folder is empty when an exception occurs, but the df.write.parquet(path, 'overwrite')folder contains this file before execution .

I tried setting spark.sql.cacheMetadata to false, but that didn't help. Spark .directory .listTables () returns an empty list, so it makes no sense to update anything.

Currently, I am just deleting output folder items and writing dataframe. It works. But why does the original overwrite method fail?

Thank.

+4
source share
2 answers

RDD ​​, , , (getPartition), () .

, ,

1- = > ... = > A 2- = > A = > = > A

, A. , Spark DAG, , , (Save to a), /.

, Spark , , , .

, , , , tmp.

+2

, , - -

df.cache()

hdfs.

+1

Source: https://habr.com/ru/post/1671506/


All Articles