Spark SQL case-insensitive filter for column conditions

Question

Spark SQL case-insensitive filter for column conditions

How to use a Spark SQL filter as a case-insensitive filter.

For instance:

dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));

just return the rows in which the 'vendor' column is equal to 'fortinet' but I want the rows in the 'vendor' column equal to 'fortinet' 'Fortinet' 'foRtinet' or ...

+8

apache-spark apache-spark-sql spark-dataframe

Arman Jan 20 '16 at 7:52

source share

2 answers

Try using the lower / upper string functions:

 dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

or

 dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))

+11

Shawn guo Jan 20 '16 at 8:13

source share

zero323 · Accepted Answer · 2016-01-20T08:05:13+0000

You can use a case-insensitive regular expression:

 val df = sc.parallelize(Seq( (1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo") )).toDF("k", "v") df.where($"v".rlike("(?i)^fortinet$")).show // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+

or simple equality with lower / upper :

 import org.apache.spark.sql.functions.{lower, upper} df.where(lower($"v") === "fortinet") // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+ df.where(upper($"v") === "FORTINET") // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+

For simple filters, I would prefer rlike , although performance should be the same, for join conditions equality is a much better choice. See How can we join two Spark SQL frames using the SQL-esque "LIKE" criterion? for details.

Spark SQL case-insensitive filter for column conditions

More articles: