Apache_beam.transforms.util.Reshuffle () is not available for GCP data stream

I updated to the latest apache_beam [gcp] package through pip install --upgrade apache_beam[gcp]. However, I noticed that Reshuffle () does not appear in the distribution [gcp]. Does this mean that I will not be able to use Reshuffle()in any data streams? Is there any way around this? Or is it possible that the pip package is simply not updated, and if Reshuffle () is in master on github, then it will be available in the data stream?

Based on the answer to this question , I try to read data from BigQuery and then randomize the data before writing it to CSV in the GCP storage bucket, I noticed that my .csv shard, which I use to train my GCMLE model, is actually not accidental. Within the tensor stream, I can randomize the batches, but it will only randomize the lines in each file that is created in the queue, and my problem is that the files currently being created are somehow biased. If there are any suggestions for other shuffling methods right before writing to the CSV in the data stream that would be much appreciated.

+4
source share
1 answer

, .

import random

shuffled_data = (unshuffled_pcoll
        | 'AddRandomKeys' >> Map(lambda t: (random.getrandbits(32), t))
        | 'GroupByKey' >> GroupByKey()
        | 'RemoveRandomKeys' >> FlatMap(lambda t: t[1]))

, ExpandIterable code

+3

Source: https://habr.com/ru/post/1693083/


All Articles