I experimented with various ways of filtering a typed dataset. It turns out that performance can be completely different.
The data set was created based on 1.6 GB data series with 33 columns and 4226047 rows. A DataSet is created by loading csv data and matching it with the case class.
val df = spark.read.csv(csvFile).as[FireIncident]
The filter on UnitId = 'B02' should return 47980 lines. I checked three ways, as shown below: 1) Use a typed column (~ 500 ms on the local host)
df.where($"UnitID" === "B02").count()
2) Use a temporary table and sql query (~ same as option 1)
df.createOrReplaceTempView("FireIncidentsSF") spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()
3) Use a field with a strong typed class (14 987 ms, i.e. 30 times slower)
df.filter(_.UnitID.orNull == "B02").count()
I tested it again using the python API for the same dataset, the time is 17 046 ms, which is comparable to the performance of the scala API.
df.filter(df['UnitID'] == 'B02').count()
Can someone shed some light on how 3) and the python API is executed differently than the first two options?
source share