The most direct translation of your code:
from pyspark.sql import functions as F
However, you should only filter this only if the number "ORDER_ID" is definitely small (possibly less than 100,000 or so).
If the number ORDER_ID is large, you should use a broadcast variable that sends an order_ids list to each executor so that it can be compared with order_ids locally for faster processing. Please note: this will work even if "ORDER_ID" is small.
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()] order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
source share