Filter rows in a spark dataframe from words in RDD

Question

Filter rows in a spark dataframe from words in RDD

I have the following commands in sparks,

data = sqlContext.sql("select column1, column2, column3 from table_name")

words = sc.textFile("words.txt")

words.txthas a bunch of words, and the data has three column rows taken from table_name.

Now I want to filter the rows in data (spark dataframe) whenever the collocation of each word from words.txtoccurs in any of the three columns of data.

For example, if it words.txthas a word, for example gon, and if any of the three columns of data contains values like bygone, goneetc., I want to filter this row.

I tried the following:

data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas()

It works for one word. But I want to check all the words from words.txtand delete it. Is there any way to do this?

I am new to PySpark. Any suggestions would be helpful.

+4

python regex apache-spark pyspark spark-dataframe

haimen 21 . '16 19:26

1

Wiktor Stribiżew · Answer 1 · 2016-08-21T21:52:58+0000

words.txt :

(?s)^(?=.*word1)(?=.*word2)(?=.*word3)

... (?s) . , ^ , (?=...) lookahead .

, var rx, :

data.filter(~data['column1'].rlike(rx) | data['column2'].rlike(rx) | data['column3'].rlike(rx)).toPandas()

regex rlike, like, .

Filter rows in a spark dataframe from words in RDD

More articles: