How to get a Cartesian product from two PCollections

Question

How to get a Cartesian product from two PCollections

I am very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCllections (1, 2) and ("hello", "world") , their Cartesian product ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")) .

Any ideas how I could do this? In addition, since the Cartesian product can be large, I hope that the solution will be lazy to create the product and, thus, avoid huge memory consumption.

Thanks!

+5

google-cloud-dataflow

Youness bennani Jan 26 '16 at 7:24

source share

1 answer

Ben chambers · Accepted Answer · 2016-01-27T01:22:14+0000

In general, the calculation of the Cartesian product will be expensive. If either (or both) collections fit into memory, you can use side-inputs to transfer data to all employees. So, for your example, you turn PCollection<String> into a side input, and then you have ParDo , which took it as the main input. For each line in the main input, you can access the side input, which has an Iterable<String> all values, and you must output pairs (or you could choose to display only pairs of this line up in this DoFn ).

This will be repeated throughout the set of words every time - if it fits into the memory, this should be good. If it needs to retrieve lateral input every time this can be problematic.

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and create PCollection values using three-letter prefixes. You can also create a similar PCollection with a key with 3 letter suffixes. Then you can GroupByKey (or CoGroupByKey ). After that, you have for every 3-letter key, all words with this as a prefix and what as a suffix.

How to get a Cartesian product from two PCollections

More articles: