Get identifiers for duplicate rows (considering all other columns) in Apache Spark

Question

Get identifiers for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql SQL block consisting of columns IDand n"data", i.e.

id | dat1 | dat2 | ... | datn

The column IDis uniquely defined, while the search dat1 ... datnmay have duplicates.

My goal is to find IDthese duplicates.

My approach so far:

get duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
append dup_dfwith everything dfto get duplicate lines , including ID :
df.join(dup_df, df.columns[1:])

I am pretty sure that this is basically correct, it fails because the columns dat1 ... datncontain values null.

join on null, .e.g SO. " ".

, :

/ / joins null ?
, , ( , ,...) ID s?

BTW: Spark 2.1.0 Python 3.5.3

+6

apache-spark pyspark apache-spark-sql pyspark-sql

Tw UxTLi51Nus 29 . '17 13:30

1

user6910411 · Accepted Answer · 2017-03-29T17:01:22+0000

ids , groupBy collect_list.

from pyspark.sql.functions import collect_list, size

:

df = sc.parallelize([
    (1, "a", "b", 3),
    (2, None, "f", None),
    (3, "g", "h", 4),
    (4, None, "f", None),
    (5, "a", "b", 3)
]).toDF(["id"])

:

(df
   .groupBy(df.columns[1:])
   .agg(collect_list("id").alias("ids"))
   .where(size("ids") > 1))

:

+----+---+----+------+
|  _2| _3|  _4|   ids|
+----+---+----+------+
|null|  f|null|[2, 4]|
|   a|  b|   3|[1, 5]|
+----+---+----+------+

explode ( udf) , join.

, id . :

from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min

:

w = Window.partitionBy(df.columns[1:])

:

(df
    .select(
        "*", 
        count("*").over(w).alias("_cnt"), 
        min("id").over(w).alias("group"))
    .where(col("_cnt") > 1))

:

+---+----+---+----+----+-----+
| id|  _2| _3|  _4|_cnt|group|
+---+----+---+----+----+-----+
|  2|null|  f|null|   2|    2|
|  4|null|  f|null|   2|    2|
|  1|   a|  b|   3|   2|    1|
|  5|   a|  b|   3|   2|    1|
+---+----+---+----+----+-----+

group .

Get identifiers for duplicate rows (considering all other columns) in Apache Spark

More articles: