Spark SQL case-insensitive filter for column conditions

How to use a Spark SQL filter as a case-insensitive filter.

For instance:

dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet")); 

just return the rows in which the 'vendor' column is equal to 'fortinet' but I want the rows in the 'vendor' column equal to 'fortinet' 'Fortinet' 'foRtinet' or ...

+8
source share
2 answers

You can use a case-insensitive regular expression:

 val df = sc.parallelize(Seq( (1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo") )).toDF("k", "v") df.where($"v".rlike("(?i)^fortinet$")).show // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+ 

or simple equality with lower / upper :

 import org.apache.spark.sql.functions.{lower, upper} df.where(lower($"v") === "fortinet") // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+ df.where(upper($"v") === "FORTINET") // +---+--------+ // | k| v| // +---+--------+ // | 1|Fortinet| // | 2|foRtinet| // +---+--------+ 

For simple filters, I would prefer rlike , although performance should be the same, for join conditions equality is a much better choice. See How can we join two Spark SQL frames using the SQL-esque "LIKE" criterion? for details.

+16
source

Try using the lower / upper string functions:

 dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet")) 

or

 dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET")) 
+11
source

Source: https://habr.com/ru/post/1240991/


All Articles