How to set display accuracy in PySpark Dataframe show

How to adjust display accuracy in PySpark on call .show()?

Consider the following example:

from math import sqrt
import pyspark.sql.functions as f

data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()

What outputs:

#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+

How can I change it so that it only displays 3 digits after the decimal point?

Required Conclusion:

#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This is the PySpark version of this scala question . I post it here because I cannot find the answer when looking for PySpark solutions, and I think that it may be useful to others in the future.

+4
source share
1 answer

Round

The easiest option is to use : pyspark.sql.functions.round()

from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This will save the values ​​as numeric types.

Format number

functions scala python. import.

format_number format_number API:

x , '#, ###, ###. ##', d, .

from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

StringType :

#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+

regexp_replace , ,

, rep.

from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+
+4

Source: https://habr.com/ru/post/1681095/


All Articles