Divide the column elements by the sum of the elements (one column) grouped by the elements of another column

I was working on an aSspark application and tried to convert a data frame, as shown in table 1. I want to divide each element of a column (_2) by the sum of elements (of the same column) grouped by elements of another column (_1). Table 2 - Expected Result.

Table 1

+---+---+
| _1| _2|
+---+---+
|  0| 13|
|  0|  7|
|  0|  3|
|  0|  1|
|  0|  1|
|  1|  4|
|  1|  8|
|  1| 18|
|  1|  4|
+---+---+

table 2

+---+----+
| _1| _2 |
+---+----+
|  0|13/x|
|  0| 7/x|
|  0| 3/x|
|  0| 1/x|
|  0| 1/x|
|  1| 4/y|
|  1| 8/y|
|  1|18/y|
|  1| 4/y|
+---+----+

where x = (13 + 7 + 3 + 1 + 1) and y = (4 + 8 + 18 + 4)

Then I want to calculate the entropy for each element in column _1: i.e. for each element in column _1, calculate the sum (p_i x log (p_i)) in column _2. Where, p_i are basically the values ​​in column _2 for each value in column _1 in table 2 .

The final result will be.

+---+---------+
| _1| ENTROPY |
+---+---------+
|  0|entropy_1|
|  1|entropy_2|
+---+---------+

spark ( scala)? ? Scala, .

.

+6
1

, , . :

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("_1").rowsBetween(Long.MinValue, Long.MaxValue)

:

import org.apache.spark.sql.functions.sum

val p = $"_2" / sum($"_2").over(w)
val withP = df.withColumn("p", p)

, , :

import org.apache.spark.sql.functions.log2

withP.groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

val df = Seq(
  (0, 13), (0, 7), (0, 3), (0, 1), (0, 1), (1, 4), (1, 8), (1, 18), (1, 4)).toDF

:

+---+------------------+
| _1|           entropy|
+---+------------------+
|  1|1.7033848993102918|
|  0|1.7433726580786888|
+---+------------------+

, --:

df.groupBy($"_1").agg(sum("_2").alias("total"))
  .join(df, Seq("_1"), "inner")
  .withColumn("p", $"_2" / $"total")
  .groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

:

df.groupBy($"_1").agg(sum("_2").alias("total"))

_2 _1,

_.join(df, Seq("_1"), "inner")

,

_.withColumn("p", $"_2" / $"total")

:

_.groupBy($"_1").agg((-sum($"p" * log2($"p"))).alias("entropy"))

.

+4

Source: https://habr.com/ru/post/1657570/


All Articles