I have the following commands in sparks,
data = sqlContext.sql("select column1, column2, column3 from table_name")
words = sc.textFile("words.txt")
words.txthas a bunch of words, and the data has three column rows taken from table_name.
Now I want to filter the rows in data (spark dataframe) whenever the collocation of each word from words.txtoccurs in any of the three columns of data.
For example, if it words.txthas a word, for example gon, and if any of the three columns of data contains values ββlike bygone, goneetc., I want to filter this row.
I tried the following:
data.filter(~data['column1'].like('%gon%') | data['column2'].like('%gon%') | data['column3'].like('%gon%')).toPandas()
It works for one word. But I want to check all the words from words.txtand delete it. Is there any way to do this?
I am new to PySpark. Any suggestions would be helpful.