In general, the calculation of the Cartesian product will be expensive. If either (or both) collections fit into memory, you can use side-inputs to transfer data to all employees. So, for your example, you turn PCollection<String> into a side input, and then you have ParDo , which took it as the main input. For each line in the main input, you can access the side input, which has an Iterable<String> all values, and you must output pairs (or you could choose to display only pairs of this line up in this DoFn ).
This will be repeated throughout the set of words every time - if it fits into the memory, this should be good. If it needs to retrieve lateral input every time this can be problematic.
Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and create PCollection values ββusing three-letter prefixes. You can also create a similar PCollection with a key with 3 letter suffixes. Then you can GroupByKey (or CoGroupByKey ). After that, you have for every 3-letter key, all words with this as a prefix and what as a suffix.
source share