Df filter when values ​​match part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrameone and I want to save (s filter) all rows where the URL stored in the column locationcontains a predefined row, for example. 'Google.com'.

I tried df.filter(sf.col('location').contains('google.com') but it throws

TypeError: _TypeError: 'Column' object is not callable'

How to get around and filter my df correctly? Thanks in advance!

+28
source share
3 answers

Spark 2.2 onwords

df.filter(df.contains.like('%google.com%'))

Spark 2.2 documentation link


Spark 2.1 and earlier

You can use plain SQL infilter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

+55

pyspark.sql.Column.contains() pyspark 2.2 .

df.where(df.location.contains('google.com'))
+12

When filtering a DataFrame with string values, I found that pyspark.sql.functions lowerit is upperuseful if your data can have entries in columns such as "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))
+1
source

Source: https://habr.com/ru/post/1014486/


All Articles