I have this weird mistake. I have a program that reads a data frame if it exists (or creates it otherwise), modifies it, and then saves it again on the same target path in a parquet format with overwrite mode.
In the first run, when there is no data frame, I create it and save it. It generates 4 files in the output folder:
- _SUCCESS.crc
- part-R- <......> snappy.parquet.crc
- _SUCCESS
- part-R- <......> snappy.parquet
Then in the second run, I read the data, modify it, and when I try to overwrite it, it throws an exception that *part-r-<.....>.snappy.parquet file does not exist*.
The output folder is empty when an exception occurs, but the df.write.parquet(path, 'overwrite')folder contains this file before execution .
I tried setting spark.sql.cacheMetadata to false, but that didn't help. Spark .directory .listTables () returns an empty list, so it makes no sense to update anything.
Currently, I am just deleting output folder items and writing dataframe. It works. But why does the original overwrite method fail?
Thank.
source
share