Considering
import pyspark.sql.functions as psf
There are two types of broadcasts:
sc.broadcast() to copy python objects for each node for more efficient use of psf.isinpsf.broadcast inside a join to copy your pyspark framework to each node when the number frame is small: df1.join(psf.broadcast(df2)) . It is commonly used for Cartesian products (CROSS JOIN in pigs).
In the context of the question, filtering was performed using a column of another data frame, hence a possible solution with a join.
Keep in mind that if your filter list is relatively large, the search operation on it will take some time, and since this is done for each row, it can quickly become expensive.
Merges, on the other hand, merge with two data files that will be sorted before matching, so if your list is small enough, you may not need to sort the huge data frame just for the filter.
source share