Google Cloud DataFlow Randomize WritetoBigQuery

Question

Google Cloud DataFlow Randomize WritetoBigQuery

I have successfully completed the data stream pipeline that writes to BigQuery. This pipeline converts data for the Cloud ML Engine to work. However, I noticed that the lines that were written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually seem to be organized somehow (which is not completely random). Then, when I export the table to shaded.csv in GCS, each laid out .csv is essentially ordered. This means that data cannot be entered into TensorFlow randomly, since TF captures one .csv at a time, and .csv themselves are not random bags or strings.

Can someone explain why the BigQuery table written by the Apache ray pipeline would not be random if the original input was randomized? Is there a way to force shuffle / randomize strings before writing in BigQuery? I need to make sure that the training data is completely random before loading into the ML model.

+1

google-cloud-platform google-bigquery google-cloud-dataflow

reese0106 Oct 16 '17 at 20:46

source share

1 answer

jkff · Accepted Answer · 2017-10-16T21:12:05+0000

BigQuery , ; , ORDER BY GROUP BY. , BigQuery , , - https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

Google Cloud DataFlow Randomize WritetoBigQuery

More articles: