> df.cube($"x", $"y").count.show // +----+----+-----+ // | x| y|count| // +----+----+-----+ // |null| 1| 1| <- count of records where y = 1 // |null| 2| 3| <- count of records where y = 2 // | foo|null| 2| <- count of records where x = foo // | bar| 2| 2| <- count of records where x = bar AND y = 2 // | foo| 1| 1| <- count of records where x = foo AND y = 1 // | foo| 2| 1| <- count of records where x = foo AND y = 2 // |null|null| 4| <- total count of records // | bar|null| 2| <- count of records where x = bar // +----+----+-----+
rollup 与 cube 相比,没有了 x=null 的情况,也就是没有单独统计 y 的数量(有点类似 left join ?)
1 2 3 4 5 6 7 8 9 10 11
> df.rollup($"x", $"y").count.show +----+----+-----+ | x| y|count| +----+----+-----+ | foo|null| 2| <- count where x is fixed to foo | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2 | foo| 1| 1| ... | foo| 2| 1| ... |null|null| 4| <- count where no column is fixed | bar|null| 2| <- count where x is fixed to bar +----+----+-----+
groupBy 和我们熟悉的 SQL 里面的 group by 是一样的。
1 2 3 4 5 6 7 8 9
df.groupBy($"x", $"y").count.show
// +---+---+-----+ // | x| y|count| // +---+---+-----+ // |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP // |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP // |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP // +---+---+-----+