DataFrame.filter, which is an alias for DataFrame.where, expects SQL expressions expressed either as Column:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe that you are trying to use RDD.filter, which is a completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not use SQL optimization.
source
share