New Dataframe column as a common function of other rows (spark)

How to efficiently create a new column in DataFrame which is a function of other rows in spark ?

This is the implementation of the sparkproblem described here here :

from nltk.metrics.distance import edit_distance as edit_dist
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

d = {
    'id': [1, 2, 3, 4, 5, 6],
    'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']

spark_df = sqlCtx.createDataFrame(pd.DataFrame(d))
words_list = list('word').collect())

get_n_similar = udf(
    lambda word: len(
            w for w in words_list if (w['word'] != word) and 
            (edit_dist(w['word'], word) < 2)

spark_df.withColumn('n_similar', get_n_similar(col('word'))).show()


|id |word    |n_similar|
|1  |cat     |1        |
|2  |hat     |2        |
|3  |hag     |2        |
|4  |hog     |2        |
|5  |dog     |1        |
|6  |elephant|0        |

The problem is that I don’t know a way to say sparkcompare the current line with other lines in DataFrame, without first collecting the values ​​in list. Is there a way to apply the general function of other strings without calling collect?

source share
1 answer

, , Dataframe .

UDF ( DataFrame udf). - :

from pyspark.sql.functions import levenshtein, col

result = (spark_df.alias("l")
    .where(levenshtein("l.word", "r.word") < 2)
    .where(col("l.word") != col("r.word"))
    .groupBy("", "l.word")

- : Apache Spark

, .

, :

    .where(levenshtein("l.word", "r.word") < 2)
    .groupBy("", "l.word")
    .withColumn("count", col("count") - 1))

( , ), :

    .select("id", "word")
    .join(result, ["id", "word"], "left")


All Articles