Python, pyspark: get the sum of values ​​of a pyspark dataframe data column

Say I have such a data frame

name age city
abc   20  A
def   30  B

I want to add a summary line at the end of the data framework, so the result will look like

name age city
abc   20  A
def   30  B
All   50  All

So, String 'All', I can easily set, but how to get the sum (df ['age']) ### column object is not iterable

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
 #|-- name: string (nullable = true)
 #|-- age: long (nullable = true)
 #|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns))  ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error.   If i am using [('All',50,'All')], it is doing fine. 

I usually work on a Pandas dataframe and a new one for Spark. Could be my lack of information about a spark frame that is not ripe.

Please suggest how to get the sum from the dataframe column in pyspark. And if there is a better way to add / add a row to the end of the data frame. Thank.

+4
source share
2 answers

Spark SQL pyspark.sql.functions.
, :

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+
+11

, . , : data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

, , , ? dataframe , , .

+2

Source: https://habr.com/ru/post/1654665/


All Articles