I have successfully completed the data stream pipeline that writes to BigQuery. This pipeline converts data for the Cloud ML Engine to work. However, I noticed that the lines that were written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually seem to be organized somehow (which is not completely random). Then, when I export the table to shaded.csv in GCS, each laid out .csv is essentially ordered. This means that data cannot be entered into TensorFlow randomly, since TF captures one .csv at a time, and .csv themselves are not random bags or strings.
Can someone explain why the BigQuery table written by the Apache ray pipeline would not be random if the original input was randomized? Is there a way to force shuffle / randomize strings before writing in BigQuery? I need to make sure that the training data is completely random before loading into the ML model.
source
share