Spark dataframe

When I look at frame data on the spark shell (version 1.6), the column names are case insensitive. On Spark-Shell

 val a = sqlContext.read.parquet("<my-location>")
   a.filter($"name" <=> "andrew").count()
   a.filter($"NamE" <=> "andrew").count()

Both of the above results give me the correct score. But when I create this in the bank and run through the "spark-submit", the code below does not say that NamE does not exist, since the basic data of the parquet was saved with the column as "name"

Fails:

a.filter($"NamE" <=> "andrew").count()

Pass:

a.filter($"NamE" <=> "andrew").count()

Am I missing something here? Is there a way that I can do case insensitive. I know that I can use the select filter before filtering and make all columns lowercase aliases, but I wanted to know why it behaves differently.

+4
source share
3 answers

: , , SQLContext , . SQLContext, HiveContext:

scala> sqlContext.getClass res3: Class[_ <: org.apache.spark.sql.SQLContext] = class org.apache.spark.sql.hive.HiveContext

-submit, , , SQLContext. @LostInOverflow: Hive is case insensitive, while Parquet is not, : HiveContext , , , Hive, Parquet. , , . SQLContext , .

+6

, :

... ,

:

val b = df.toDF(df.columns.map(_.toLowerCase): _*)
b.filter(...)
+6

Try explicitly setting case sensitivity with sqlContext. Turn off case sensitivity using the instructions below and see if it helps.

sqlContext.sql("set spark.sql.caseSensitive=false")

+1
source

Source: https://habr.com/ru/post/1663421/


All Articles