I believe that you cannot completely avoid the problem, but there is a simple trick that you can reduce its scale. The idea is to replace all columns that should not be isolated with one placeholder.
For example, if you have a DataFrame
:
val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")
and you are interested in a cube isolated from d
and e
and grouped by a..c
, you can define a replacement for a..c
as:
import org.apache.spark.sql.functions.struct import sparkSql.implicits._ // alias here may not work in Spark 1.6 val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")
and cube
:
val cubed = Seq($"d", $"e") // If there is a problem with aliasing rest it can done here. val tmp = df.cube(rest.alias("rest") +: cubed: _*).count
A quick filter and selection should handle the rest:
tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)
with a result like:
+---+---+---+----+----+-----+ | a| b| c| d| e|count| +---+---+---+----+----+-----+ | 1| 2| 3|null| 5| 1| | 1| 2| 3|null|null| 1| | 1| 2| 3| 4| 5| 1| | 1| 2| 3| 4|null| 1| +---+---+---+----+----+-----+
source share