In a Python notebook on the Databricks "Community Edition," I experiment with the city of San Francisco by discovering 911 emergency calls for firefighters. (An old copy of the data in 2016 used in “Using Apache Spark 2.0 to Analyze the City of San Francisco Open Data” (YouTube) and made available on S3 for this tutorial.)
After setting the data and reading it with an explicitly defined schema in the DataFrame fire_service_calls_df, I jinxed the DataFrame as an SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
With this and the DataFrame API, I can count the types of calls that occurred:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... or with SQL in Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
| 33|
+------------------------+
... or using an SQL cell:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls

? (, 34 , "35".)