Pyspark: isin vs join

What are the general guidelines for filtering a data frame in pyspark for a given list of values? In particular:

Depending on the size of this list of values, then in terms of runtime, when is it better to use isin vs inner join vs broadcast ?

This question is a spark counterpart to the following question in Pig:

Pig: effective filtering of the loaded list

Additional context:

Pyspark isin function

+5
source share
1 answer

Considering

 import pyspark.sql.functions as psf 

There are two types of broadcasts:

  • sc.broadcast() to copy python objects for each node for more efficient use of psf.isin
  • psf.broadcast inside a join to copy your pyspark framework to each node when the number frame is small: df1.join(psf.broadcast(df2)) . It is commonly used for Cartesian products (CROSS JOIN in pigs).

In the context of the question, filtering was performed using a column of another data frame, hence a possible solution with a join.

Keep in mind that if your filter list is relatively large, the search operation on it will take some time, and since this is done for each row, it can quickly become expensive.

Merges, on the other hand, merge with two data files that will be sorted before matching, so if your list is small enough, you may not need to sort the huge data frame just for the filter.

+5
source

Source: https://habr.com/ru/post/1271066/


All Articles