Writing to BigQuery from Dataflow - JSON files are not deleted when the task is completed

Question

Writing to BigQuery from Dataflow - JSON files are not deleted when the task is completed

One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under the hood is that Dataflow actually writes the results (hatched) in JSON format to GCS, and then runs the BigQuery download task to import this data.

However, we noticed that some JSON files are not deleted after the job, regardless of whether it was successful or failed. There are no warnings or suggestions in the error message that the files will not be deleted. When we noticed this, we looked at our bucket, and it had hundreds of large JSON files due to unsuccessful jobs (mainly during development).

I would think that Dataflow should handle any cleanup, even if the job fails, and when that succeeds, these files must be deleted. Leaving these files after completing a task carries significant storage costs!

This is mistake?

An example of a job assignment identifier that "succeeded" but left hundreds of large files in the GCS: 2015-05-27_18_21_21-8377993823053896089

enter image description here

+6

google-cloud-dataflow

Graham polley May 12, '15 at 1:20

source share

3 answers

Another possible reason for abandoned files is canceled. Currently, the data stream does not delete files from canceled jobs. In other cases, the files must be cleaned.

Also, the error indicated in the first message “Unable to delete temporary files” is the result of a problem with the log on our side and should be resolved within a week or two. Until then, feel free to ignore these errors as they do not point to the files on the left.

+5

Lara schmidt Oct 29 '15 at 18:01

source share

It was a mistake when the Dataflow service sometimes did not delete temporary JSON files after completing the BigQuery import job. We fixed the problem internally and released a fix release.

+2

Sam mcveety Sep 17 '15 at 17:51

source share

Graham polley · Accepted Answer · 2015-10-20T03:39:11+0000

Since this is still happening, we decided that we would fix it after the assembly line was completed. We run the following command to remove everything that is not a JAR or ZIP:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r

Writing to BigQuery from Dataflow - JSON files are not deleted when the task is completed

More articles: