Divide the column elements by the sum of the elements (one column) grouped by the elements of another column

Question

Divide the column elements by the sum of the elements (one column) grouped by the elements of another column

I was working on an aSspark application and tried to convert a data frame, as shown in table 1. I want to divide each element of a column (_2) by the sum of elements (of the same column) grouped by elements of another column (_1). Table 2 - Expected Result.

Table 1

+---+---+
| _1| _2|
+---+---+
|  0| 13|
|  0|  7|
|  0|  3|
|  0|  1|
|  0|  1|
|  1|  4|
|  1|  8|
|  1| 18|
|  1|  4|
+---+---+

table 2

+---+----+
| _1| _2 |
+---+----+
|  0|13/x|
|  0| 7/x|
|  0| 3/x|
|  0| 1/x|
|  0| 1/x|
|  1| 4/y|
|  1| 8/y|
|  1|18/y|
|  1| 4/y|
+---+----+

where x = (13 + 7 + 3 + 1 + 1) and y = (4 + 8 + 18 + 4)

Then I want to calculate the entropy for each element in column _1: i.e. for each element in column _1, calculate the sum (p_i x log (p_i)) in column _2. Where, p_i are basically the values in column _2 for each value in column _1 in table 2 .

The final result will be.

+---+---------+
| _1| ENTROPY |
+---+---------+
|  0|entropy_1|
|  1|entropy_2|
+---+---------+

spark ( scala)? ? Scala, .

.

+6

scala apache-spark apache-spark-sql spark-dataframe

accssharma 13 . '16 2:57

1

user6910411 · Accepted Answer · 2016-10-13T19:11:50+0000

, , . :

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("_1").rowsBetween(Long.MinValue, Long.MaxValue)

:

import org.apache.spark.sql.functions.sum

val p = $"_2" / sum($"_2").over(w)
val withP = df.withColumn("p", p)

, , :

import org.apache.spark.sql.functions.log2

withP.groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

val df = Seq(
  (0, 13), (0, 7), (0, 3), (0, 1), (0, 1), (1, 4), (1, 8), (1, 18), (1, 4)).toDF

:

+---+------------------+
| _1|           entropy|
+---+------------------+
|  1|1.7033848993102918|
|  0|1.7433726580786888|
+---+------------------+

, --:

df.groupBy($"_1").agg(sum("_2").alias("total"))
  .join(df, Seq("_1"), "inner")
  .withColumn("p", $"_2" / $"total")
  .groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

:

df.groupBy($"_1").agg(sum("_2").alias("total"))

_2 _1,

_.join(df, Seq("_1"), "inner")

,

_.withColumn("p", $"_2" / $"total")

:

_.groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

.

Divide the column elements by the sum of the elements (one column) grouped by the elements of another column

More articles: