Pyspark filter data block by columns of another data block

Question

Pyspark filter data block by columns of another data block

Not sure why I have a hard time with this, it seems so simple, given that it is pretty easy to do in R or pandas. I wanted to avoid using pandas, though, since I deal with a lot of data, and I believe that toPandas() loads all the data into the driver memory in pyspark.

I have 2 data frames: df1 and df2 . I want to filter df1 (delete all lines), where df1.userid = df2.userid AND df1.group = df2.group . I was not sure whether to use filter() , join() or sql For example:

 df1: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 348| 2|[225, 2235, 2225] | | 567| 1|[1110, 1150] | | 595| 1|[1150, 1150, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+ df2: +------+----------+---------+ |userid| group | pick | +------+----------+---------+ | 348| 2| 2270| | 595| 1| 2125| +------+----------+---------+ Result I want: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 567| 1|[1110, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+

EDIT: I tried a lot of join () and filter () functions, I believe the closest I got:

 cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows

I tried a bunch of different connection types, and I also tried different cond values: cond = ((df1.userid == df2.userid) and (df2.group == df2.group)) # result has 7 lines cond = ((df1 .userid! = df2.userid) and (df2.group! = df2.group)) # result has 2 lines

However, it seems that the connections add extra lines, rather than delete.

I am using python 2.7 and spark 2.1.0

+5

python-2.7 apache-spark pyspark spark-dataframe

drewyupdrew Feb 09 '17 at 23:04

source share

1 answer

user6910411 · Accepted Answer · 2017-02-10T00:46:14+0000

Left anti-join is what you are looking for:

 df1.join(df2, ["userid", "group"], "leftanti")

but the same thing can be done using the left outer join:

 (df1 .join(df2, ["userid", "group"], "leftouter") .where(df2["pick"].isNull()) .drop(df2["pick"]))

Pyspark filter data block by columns of another data block

More articles: