Writing to BigQuery from Dataflow - JSON files are not deleted when the task is completed

One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under the hood is that Dataflow actually writes the results (hatched) in JSON format to GCS, and then runs the BigQuery download task to import this data.

However, we noticed that some JSON files are not deleted after the job, regardless of whether it was successful or failed. There are no warnings or suggestions in the error message that the files will not be deleted. When we noticed this, we looked at our bucket, and it had hundreds of large JSON files due to unsuccessful jobs (mainly during development).

I would think that Dataflow should handle any cleanup, even if the job fails, and when that succeeds, these files must be deleted. Leaving these files after completing a task carries significant storage costs!

This is mistake?

An example of a job assignment identifier that "succeeded" but left hundreds of large files in the GCS: 2015-05-27_18_21_21-8377993823053896089

enter image description here

enter image description here

enter image description here

+6
source share
3 answers

Since this is still happening, we decided that we would fix it after the assembly line was completed. We run the following command to remove everything that is not a JAR or ZIP:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r 
+5
source

Another possible reason for abandoned files is canceled. Currently, the data stream does not delete files from canceled jobs. In other cases, the files must be cleaned.

Also, the error indicated in the first message β€œUnable to delete temporary files” is the result of a problem with the log on our side and should be resolved within a week or two. Until then, feel free to ignore these errors as they do not point to the files on the left.

+5
source

It was a mistake when the Dataflow service sometimes did not delete temporary JSON files after completing the BigQuery import job. We fixed the problem internally and released a fix release.

+2
source

Source: https://habr.com/ru/post/987007/


All Articles