Not sure why I have a hard time with this, it seems so simple, given that it is pretty easy to do in R or pandas. I wanted to avoid using pandas, though, since I deal with a lot of data, and I believe that toPandas() loads all the data into the driver memory in pyspark.
I have 2 data frames: df1 and df2 . I want to filter df1 (delete all lines), where df1.userid = df2.userid AND df1.group = df2.group . I was not sure whether to use filter() , join() or sql For example:
df1: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 348| 2|[225, 2235, 2225] | | 567| 1|[1110, 1150] | | 595| 1|[1150, 1150, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+ df2: +------+----------+---------+ |userid| group | pick | +------+----------+---------+ | 348| 2| 2270| | 595| 1| 2125| +------+----------+---------+ Result I want: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 567| 1|[1110, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+
EDIT: I tried a lot of join () and filter () functions, I believe the closest I got:
cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different connection types, and I also tried different cond values: cond = ((df1.userid == df2.userid) and (df2.group == df2.group)) # result has 7 lines cond = ((df1 .userid! = df2.userid) and (df2.group! = df2.group)) # result has 2 lines
However, it seems that the connections add extra lines, rather than delete.
I am using python 2.7 and spark 2.1.0