Seq.contains in SQL in a Spark Dataframe

Question

Seq.contains in SQL in a Spark Dataframe

I have the following data structure:

id: int
records: Seq[String]
other: boolean

In the json file, to facilitate testing:

var data = sc.makeRDD(Seq[String](
   "{\"id\":1, \"records\": [\"one\", \"two\", \"three\"], \"other\": true}", 
   "{\"id\": 2, \"records\": [\"two\"], \"other\": true}", 
   "{\"id\": 3, \"records\": [\"one\"], \"other\": false }"))
sqlContext.jsonRDD(data).registerTempTable("temp")

And I would like to filter out only records with onein field recordsand otherequal trueusing only SQL.

I can do this using filter(see below), but is it possible to do this only using SQL?

sqlContext
    .sql("select id, records from temp where other = true")
    .rdd.filter(t => t.getAs[Seq[String]]("records").contains("one"))
    .collect()

+4

scala apache-spark apache-spark-sql spark-dataframe

Michael Lloyd Lee mlk Nov 04 '15 at 12:15

source share

1 answer

eliasah · Accepted Answer · 2015-11-04T13:21:33+0000

Spark SQL supports a huge number of Hive functions, so you can use array_containsto complete the task:

spark.sql("select id, records from temp where other = true and array_contains(records,'one')").show
# +---+-----------------+
# | id|          records|
# +---+-----------------+
# |  1|[one, two, three]|
# +---+-----------------+

: 1.5, sqlContext.jsonRDD , :

sqlContext.read.format("json").json(data).registerTempTable("temp")

Seq.contains in SQL in a Spark Dataframe

More articles: