One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under the hood is that Dataflow actually writes the results (hatched) in JSON format to GCS, and then runs the BigQuery download task to import this data.
However, we noticed that some JSON files are not deleted after the job, regardless of whether it was successful or failed. There are no warnings or suggestions in the error message that the files will not be deleted. When we noticed this, we looked at our bucket, and it had hundreds of large JSON files due to unsuccessful jobs (mainly during development).
I would think that Dataflow should handle any cleanup, even if the job fails, and when that succeeds, these files must be deleted. Leaving these files after completing a task carries significant storage costs!
This is mistake?
An example of a job assignment identifier that "succeeded" but left hundreds of large files in the GCS: 2015-05-27_18_21_21-8377993823053896089



source share