Df filter when values match part of a string in pyspark

Question

Df filter when values match part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrameone and I want to save (s filter) all rows where the URL stored in the column locationcontains a predefined row, for example. 'Google.com'.

I tried df.filter(sf.col('location').contains('google.com') but it throws

TypeError: _TypeError: 'Column' object is not callable'

How to get around and filter my df correctly? Thanks in advance!

+28

python apache-spark pyspark apache-spark-sql spark-dataframe

gaatjeniksaan Jan 27 '17 at 8:49

source share

3 answers

pyspark.sql.Column.contains() pyspark 2.2 .

df.where(df.location.contains('google.com'))

+12

Joao Francisco Martins 16 . '18 8:57

When filtering a DataFrame with string values, I found that pyspark.sql.functions lowerit is upperuseful if your data can have entries in columns such as "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))

+1

caffreyd May 13 '19 at 20:53

source share

mrsrinivas · Accepted Answer · 2017-01-27T09:07:11+0000

Spark 2.2 onwords

df.filter(df.contains.like('%google.com%'))
Spark 2.2 documentation link

Spark 2.1 and earlier

You can use plain SQL infilter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link

Df filter when values ​​match part of a string in pyspark

Spark 2.2 onwords

Spark 2.1 and earlier

More articles:

Df filter when values match part of a string in pyspark