How to use a “cube” only for certain fields in a Spark data frame?

I use Spark 1.6.1 and I have such a data frame.

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+ | scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value| +-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+ | test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| ...... 

And I want to get a dataframe as shown below using a cube operation.

(Grouped by all fields, but only the fields "os_name", "country", "app_ver" have a cube)

 +-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+ | scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|cnt| +-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+ | test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| 9| | test_home|scene_enter| test_home| null| KR| 5.6.3|__OTHERS__| false| test| test| test| 35| | test_home|scene_enter| test_home|android| null| 5.6.3|__OTHERS__| false| test| test| test| 98| | test_home|scene_enter| test_home|android| KR| null|__OTHERS__| false| test| test| test|101| | test_home|scene_enter| test_home| null| null| 5.6.3|__OTHERS__| false| test| test| test|301| | test_home|scene_enter| test_home| null| KR| null|__OTHERS__| false| test| test| test|225| | test_home|scene_enter| test_home|android| null| null|__OTHERS__| false| test| test| test|312| | test_home|scene_enter| test_home| null| null| null|__OTHERS__| false| test| test| test|521| ...... 

I tried as below, but it seems slow and ugly ..

 var cubed = df .cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value") .count .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL") 

Are there any better solutions?

+6
source share
1 answer

I believe that you cannot completely avoid the problem, but there is a simple trick that you can reduce its scale. The idea is to replace all columns that should not be isolated with one placeholder.

For example, if you have a DataFrame :

 val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f") 

and you are interested in a cube isolated from d and e and grouped by a..c , you can define a replacement for a..c as:

 import org.apache.spark.sql.functions.struct import sparkSql.implicits._ // alias here may not work in Spark 1.6 val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest") 

and cube :

 val cubed = Seq($"d", $"e") // If there is a problem with aliasing rest it can done here. val tmp = df.cube(rest.alias("rest") +: cubed: _*).count 

A quick filter and selection should handle the rest:

 tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*) 

with a result like:

 +---+---+---+----+----+-----+ | a| b| c| d| e|count| +---+---+---+----+----+-----+ | 1| 2| 3|null| 5| 1| | 1| 2| 3|null|null| 1| | 1| 2| 3| 4| 5| 1| | 1| 2| 3| 4|null| 1| +---+---+---+----+----+-----+ 
+4
source

Source: https://habr.com/ru/post/1012582/


All Articles